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(particularly systems implementations), but little work has been done on formally studying the 
expressiveness of the formalisms proposed or on the theoretical foundations of wrapping. In this 
paper, we first study monadic datalog over trees as a wrapping language. We show that this simple 
language is equivalent to monadic second order logic (MSO) in its ability to specify wrappers. We 
believe that MSO has the right expressiveness required for Web information extraction and propose 
MSO as a yardstick for evaluating and comparing wrappers. Along the way, several other results 
on the complexity of query evaluation and query containment for monadic datalog over trees are 
established, and a simple normal form for this language is presented. Using the above results, we 
subsequently study the kernel fragment Elog - of the Elog wrapping language used in the Lixto 
system (a visual wrapper generator). Curiously, Elog - exactly captures MSO, yet is easier to use. 
Indeed, programs in this language can be entirely visually specified. 

Categories and Subject Descriptors: F.1.1 [Computation by Abstract Devices]: Automata; 
F.4.1 [Mathematical Logic and Formal Languages]: Computational Logic; F.4.3 [Math¬ 
ematical Logic and Formal Languages]: Classes defined by grammars or automata; H.2.3 
[Database Management]: Query languages; 1.7.2 [Document Preparation]: Markup lan¬ 
guages 

General Terms: Theory, Languages, Algorithms 

Additional Key Words and Phrases: Complexity, Expressiveness, HTML, Information Extraction, 
Monadic Datalog, MSO, Regular Tree Languages, Web Wrapping 


1. INTRODUCTION 

The Web wrapping problem, i.e., the problem of extracting structured information 
from HTML documents, is one of high practical importance and has spurred a great 
amount of work, including theoretical research (e.g., [Atzeni and Mecca 1997]) as 
well as systems. Previous work can be classified into two categories, depending 
on whether the HTML input is regarded as a sequential character string (e.g., 
TSIMMIS [Papakonstantinou et al. 1995], Editor [Atzeni and Mecca 1997], FLORID 
[Ludascher et al. 1998], and DEByE [Laender et al. 2002]) or a pre-parsed document 
tree (for instance, W4F [Sahuguet and Azavant 2001], XWrap [Liu et al. 2000], and 
Lixto [Baumgartner et al. 2001b; 2001a; Lixto ]). The latter category of work thus 
assumes that systems may make use of an existing HTML parser as a front end. 
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From the standpoint of theory, many practical problems are presumably sim¬ 
pler to solve over the parse trees of documents rather than over the documents 
considered as strings. 1 In the light of the large legacy of Web documents that 
motivate Web information extraction in the first place, the practical perspective of 
tree-based wrapping must be emphasized. Robust wrappers are easier to program 
using a wrapper programming language that models documents as pre-parsed doc¬ 
ument trees rather than as text strings. Writing a fully standards-compliant HTML 
parser is a substantial task, which should not have to be redone from scratch for 
each wrapper being created. The use of an existing parser allows the wrapper im¬ 
plementor to focus on the essentials of each wrapping task and to work on a higher, 
more user-friendly level. No serious study of the productivity gains obtained by the 
transition from string-based to tree-based wrapping has been conducted as of yet, 
but we think that it is clear that the leap in productivity must be substantial. 

Nonlinear productivity improvements in software development are among the 
most desirable and valuable outcomes of computer science research. The often- 
observed information overload that users of the Web experience witnesses the lack 
of intelligent and encompassing Web services that provide high-quality collected and 
value-added information. At the origin of this, there is a mild form of software crisis 
in Web information extraction which calls for such productivity improvements. 

A second candidate for a substantial productivity leap, which in practice requires 
the first (tree-based representation of the source documents) as a prerequisite, is the 
visual specification of wrappers. By visual wrapper specification, we ideally mean 
the process of interactively defining a wrapper from one (or few) example docu¬ 
ment (s) using mainly “mouse clicks”, supported by a strong and intuitive design 
metaphor. During this visual process, the wrapper program should be automati¬ 
cally generated and should not actually require the human designer to use or even 
know the wrapper programming language. Visual wrapping is now a reality sup¬ 
ported by several implemented systems [Liu et al. 2000; Sahuguet and Azavant 
2001; Baumgartner et al. 2001a], however with varying thoroughness. 

Little is known about the theoretical aspects of tree-based wrapping languages. 
Clearly, languages which do not have the right expressive power and computational 
properties cannot be considered satisfactory, even if wrappers are easy to define. 
One may thus want to look for a wrapping language over document trees that 

(i) has a solid and well understood theoretical foundation, 

(ii) provides a good trade-off between complexity and the number of practical wrap¬ 
pers that can be expressed, 

(iii) is easy to use as a wrapper programming language, and 

(iv) is suitable for being incorporated into visual tools, since ideally all constructs 
of a wrapping language can be realized through corresponding visual primitives. 

This paper exhibits and studies such languages. 

It is understood in the literature that the scope of wrapping is a conceptually 
limited one. Information systems architectures that employ wrapping usually con- 

1 In fact, it is known that a word language is context-free iff it is the yield of a regular tree language 
(cf. [Gecseg and Steinby 1997]), where the yield of a tree is the sequence of labels of its leaf nodes 
extracted depth-first from left to right. 
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sist of at least two layers, a lower one that is restricted to extracting relevant data 
from data sources and making them available in a coherent representation using 
the data model supported by the higher layer, and a higher layer in which data 
transformation and integration tasks are performed which are necessary to fuse 
syntactically coherent data from distinct sources in a semantically coherent man¬ 
ner. With the term wrapping we refer to the lower, syntactic integration layer. The 
higher, semantic integration layer is not topic of this paper. Therefore, a wrapper 
is assumed to extract relevant data from a possibly poorly structured source and 
to put it into the desired representation formalism by applying a number of trans¬ 
formational changes close to the minimum possible. A wrapping language that 
permits arbitrary data transformations may be considered overkill. 

The core notion that we base our wrapping approach on is that of an informa¬ 
tion extraction function , which takes a labeled unranked tree (representing a Web 
document) and returns a subset of its nodes. In the context of the present paper, a 
wrapper is a program which implements one or several such functions, and thereby 
assigns unary predicates to document tree nodes. Based on these predicate assign¬ 
ments and the structure of the input tree, a new tree can be computed as the result 
of the information extraction process in a natural way, along the lines of the input 
tree but using the new labels and omitting nodes that have not been relabeled. 

That way, we can take a tree, re-label its nodes, and declare some of them 
as irrelevant, but we cannot significantly transform its original structure. This 
coincides with the intuition that a wrapper may change the presentation of relevant 
information, its packaging or data model (which does not apply in the case of Web 
wrapping ), but does not handle substantial data transformation tasks. We believe 
that this captures exactly the essence of wrapping. 

We propose unary queries in monadic second-order logic (MSO) over unranked 
trees as an expressiveness yardstick for information extraction functions. MSO over 
trees is well-understood theory-wise [Thatcher and Wright 1968; Doner 1970; Cour- 
celle 1990; Flum et al. 2001] (see also [Thomas 1990; 1997]) and quite expressive. 
The MSO query evaluation problem is PSPACE-complete (combined complexity). 
The parameter of most significant influence in query evaluation is of course the size 
of the data. Unary MSO queries can be evaluated in linear time with respect to 
the sizes of the input trees [Flum et al. 2001; Courcelle 1990] using techniques that, 
unfortunately, have nonelementary complexity in terms of the size of the MSO 
query 2 3 . Thus - even when assuming the size of a wrapper program (as a set of 
MSO formulae) to be small - we cannot accord satisfaction of requirement (ii). 
Moreover, MSO does not satisfy requirements (iii) and (iv): It is neither easy to 
use as a wrapping language nor does it lend itself to visual specification. 

Presently, only two formalisms are known that precisely capture the unary MSO 
queries over trees yet are computationally cheaper to process, query automata 
[Neven and Schwentick 2002], a form of deterministic two-way tree automata with 
a selection function, and boolean attribute grammars [Neven and van den Bussche 
2002]. At least the latter formalism satisfies requirement (ii) - boolean attribute 
grammars can be evaluated efficiently both in terms of the size of the data and the 
query. However, we think that neither satisfies the requirements (iii) or (iv). 


2 This is at least so under widely held complexity-theoretic assumptions [Frick and Grohe 2002]. 
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The main task of practical Web information extraction is the detection and ex¬ 
traction of interesting “objects” from a Web document. Modeling such objects in 
a wrapper often only requires a small fraction of the intuitive “complexity” of the 
full documents to be wrapped. However, both query automata and attribute gram¬ 
mars require to model the entire source documents, which may be substantially 
more cumbersome and work-intensive than just describing the objects of interest. 
Such a monolithic approach is very brittle in real-world applications where no full 
model of the source documents is available or their layouts change frequently. In 
contrast, all implemented practical systems for tree-based wrapping that we are 
aware of [Liu et al. 2000; Sahuguet and Azavant 2001; Baumgartner et al. 2001b] 
are based on wrapping languages that allow to specify the objects of interest with¬ 
out requiring to model the entire source documents. 3 

It is also worth mentioning that both query automata and boolean attribute 
grammars cause substantial notational difficulty on unranked trees , which makes 
them difficult to use on Web documents. 

The main contributions of the paper are the following. 

—We study monadic datalog and show that it is equivalent to MSO in its ability 
to express unary queries for tree nodes (in ranked as well as unranked trees). 
We also characterize the evaluation complexity of our language. We show that 
monadic datalog can be evaluated in linear time both in the size of the data and 
the query, given that tree structures are appropriately represented. Interestingly, 
judging from our experience with the Lixto system, real-world wrappers written 
in monadic datalog are small. Thus, in practice, we do not trade the lowered 
query complexity compared to MSO for considerably expanded program sizes. 
Monadic datalog over labeled trees is a very simple programming language and 
much better suited as a wrapping language than MSO. Consequently, monadic 
datalog satisfies the first three of our requirements. 

—We provide reductions from query automata (in both the ranked and unranked 
tree case) to monadic datalog. As a corollary we obtain the result that the 
containment problem for monadic datalog over trees remains EXPTIME-harcl. 
(It is known to be EXPTIME-hard over arbitrary finite structures [Cosmadakis 
et al. 1988].) This is also a demonstration of how conveniently even intricate 
automaton constructions can be simulated in our language of choice. 

Moreover, we show that monadic datalog is a more efficient device for evaluating 
queries defined by query automata than query automata themselves: while there 
are terminating runs of query automata that take superpolynomially many steps, 
the same queries are evaluated in time linear in the size of the data and quadratic 
in the size of the query automata using our reductions to monadic datalog. 

—We define a simple normal form for monadic datalog over trees, TMNF, to which 
any monadic datalog program over trees can be mapped in linear time. 

—Finally, we present a simple but practical Web wrapping language equivalent to 
MSO, which we call Elog - . Elog - is a simplified version of the core wrapping lan- 


3 We admit that attribute grammars are an elegant formalism for extracting relations from trees. 
That problem is not topic of this paper. Here, we hope to improve on the state-of-the-art of 
extracting (e.g. XML) trees from documents with a wrapping formalism that is more manageable. 
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guage of the Lixto system, Elog (“Extraction by datalog”), and can be obtained 
by slightly restricting the syntax of monadic datalog. Programs of this language 
(even recursive ones) can be completely visually specified, without requiring the 
wrapper implementor to deal with Elog - programs directly or to know datalog. 
We also give a brief overview of this visual specification process. Thus, Elog - 
satisfies all of our four desiderata for tree-based wrapping languages. 

The present work is - to the best of our knowledge - the first to provide a the¬ 
oretical study of an advanced tree-based wrapping tool and language used in an 
implemented system. In summary, we present a thorough theoretical analysis of 
expressiveness aspects of tree-based information extraction based on the expres¬ 
siveness of MSO as an intuitively justifiable yardstick for languages attacking this 
problem. We also keep the efficiency of query evaluation in mind and are able to 
guarantee linear-time evaluation for the language studied. 

The paper is structured as follows. We start with preliminaries regarding trees 
and MSO in Section 2 and introduce monadic datalog in Section 3.1. In Section 3.2, 
we present several known theoretical results on (monadic) datalog. The main tech¬ 
nical developments of this paper start with Section 4. The complexity of monadic 
datalog over trees is detailed in Section 4.1, its expressive power in Section 4.2, and 
the relationship to query automata is studied in Section 4.3. Section 5 presents the 
the transformation of monadic datalog over trees into the normal form TMNF. In 
Section 6, we define the Elog - fragment of the industrial-strength Elog language 
and study its theoretical properties. We conclude with Section 7. 

2. PRELIMINARIES 

Throughout this paper, only finite trees will be considered. Trees are defined in the 
normal way and have at least one node. We assume that the children of each node 
are in some fixed order. Each node has a label taken from a finite 4 5 nonempty set 
of symbols E, the alphabet. We consider both ranked and unranked trees. Ranked 
trees have a ranked alphabet, i.e., each symbol in E has some fixed arity or rank 
k < K (and K is the maximum rank in E, i.e. a constant integer). We may partition 
E into sets Eo,..., E k of symbols of equal rank. A node with a label a £ E*, (i.e., 
of rank k) has exactly k children. Nodes with labels of rank 0 are called leaves. 
Each ranked tree can be considered as a relational structure 

t rk = (dom, root, leaf, (child k)k<K, (label a ) aeE )- 

In an unranked tree, each node may have an arbitrary number of children. An 
unranked ordered tree can be considered as a structure 

t U r = (dom, root, leaf, (label a ) ae £, firstchild, nextsibling, lastsibling) 

where “dom” is the set of nodes in the tree, “root”, “leaf”, “lastsibling”, and 
the “label a ” relations are unary, and “firstchild”, “nextsibling”, and the “child*,” 
relations are binary. All relations are defined according to their intuitive mean¬ 
ings. “root” contains exactly one node, the root node, “leaf” consists of the 
set of all leaves, child*, denotes the fc-th direct child relation in a ranked tree. 


4 The finite alphabet choice is discussed in more detail below, in Remark 2.2. 
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In unranked trees, “firstchild(ni, 712 )” is true iff 712 is the leftmost child of rii; 
“nextsibling(ni, 712 )” is true iff, for some i, n\ and 712 are the i-th and ( i + l)-th 
children of a common parent node, respectively, counting from the left (see also 
Figure 1). label a (7i) is true iff n is labeled a in the tree. Finally, “lastsibling” 
contains the set of rightmost children of nodes. (The root node is not a last sibling, 
as it has no parent.) Whenever the structure t may not be clear from the context, 
we state it as a subscript of the relation names (as e.g. in dom t , roott, ...). 

By default, we will always assume ranked and unranked trees to be represented 
using the schemata outlined above, and will refer to them as r(for ranked trees) 
and T ur (for unranked trees), respectively. 

Monadic second-order logic (MSO) over trees is a second-order logical language 
consisting of (1) individual variables (with lower-case names x,y,...) ranging over 
nodes, also called node variables, (2) set variables (written using upper-case names 
P,Q,.. .) ranging over sets of nodes, (3) parentheses, (4) boolean connectives V 
and - 1 , (5) quantifiers V and 3 over both node and set variables, (6) the relation 
symbols of the tree structure in consideration, = (equality of node variables), and, 
as syntactic sugaring, possibly (7) the boolean operations A, —and <-> and the re¬ 
lation symbols = and C between sets. IIi-MSO refers to (universal) MSO sentences 
of the form (VPi) • • • (VP*,) if{P\, ... ,Pk) where the Pi are set variables and if is a 
first-order formula. Given an MSO formula p, its quantifier rank k is defined as the 
maximum degree of nesting of first-order as well as set-quantifiers in p. In other 
words, k is the maximum number of quantifiers encountered on any path from the 
root of the expression tree of ip to a leaf. A unary MSO query is defined by an 
MSO formula tp with one free first-order variable. Given a tree t, it evaluates to 
the set of nodes {x G dom | t k p{x)}. A tree language C is definable in MSO iff 
there is an MSO sentence p over tree structures t such that C = {t \ 1 1= p}. 

The regular tree languages (for ranked as well as for unranked alphabets) are 
precisely those tree languages recognizable by a number of natural forms of finite 
automata [Briiggemann-Klein et al. 2001]. The following is a classical result for 
ranked trees [Thatcher and Wright 1968; Doner 1970], which has been shown in 
[Neven and Schwentick 2002] to hold for unranked trees as well. 

Proposition 2.1. A tree language is regular iff it is definable in MSO. 

Remark 2.2. In the context of wrapping HTML documents, it is worthwhile 
to consider an infinite alphabet E, which allows to merge both HTML tags and 
attribute assignments into labels. This requires a generalized notion of relational 
structures (dom, Pi, R 2 , R 3 , • • •) consisting of a countable (but possibly infinite ) 
set of relations, of which only a finite number is nonempty. Even though all re¬ 
sults cited or shown in this paper (such as Proposition 2.1) were proven for finite 
alphabets, it is trivial to see that they also hold for infinite alphabets in case the 
symbols of the alphabet (i.e., the node labels) are not part of the domain, labels 
of domain elements are expressed via predicates such as label a only (rather than, 
say, a binary relation label C dom x E), and for each predicate label a we can also 
use its complement label a (in the finite-alphabet case such a complement can be 
obtained by the union U;g(E-{a}) label;). Given these requirements, it is impossible 
to quantify over symbols of E and any query in finitary logical languages can only 
refer to a finite number of symbols of the alphabet E. (See the related discussion 
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Fig. 1. (a) An unranked tree and (b) its representation using the binary relations “firstchild” (*/) 
and “nextsibling” (\). 


in [Neven and Schwentick 2000].) Another way to cope with composite tags and 
attribute values is to encode such values as lists of character symbols modeled as 
subtrees in our document tree. Whatever way is preferred, it should be clear that 
the assumption of a finite alphabet £ made in this paper is not a true limitation 
for representing real-world documents. □ 


A regular path expression (cf. [Abiteboul and Vianu 1999]) over a set of binary 
relations T is a regular expression (using concatenation the Kleene star 
and disjunction “|”) over alphabet F. Caterpillar expressions (cf. [Briiggemann- 
Klein and Wood 2000]) furthermore support inversion (i.e. expressions of the form 
A 1-1 , where A is a caterpillar expression) 5 and unary relations in F. Caterpillar 
expressions only consisting of a single relation name from T are subsequently called 
atomic, all other caterpillar expressions are called compound. Each caterpillar 
expression A is inductively interpreted as a binary relation [A] as follows. 


[A] := R ... A G T is binary 

[A] := {(a:, x) \ x £ A} ... A 6 T is unary 

[Ai.A 2 ] := {{x,z) | (By) (x,y) G [Ai] A {y,z) G [A 2 ]} 

[Ai U A 2 ] := [Ai] U [A 2 ] 

[A*] := the reflexive and transitive closure of [A] 

[A -1 ] := {(y,x) | (x,y) G [A]} 

The precedence of operations is such that Ai U A 2 .A|.A^" 1 can be used as a short¬ 
hand for Ai U (A 2 .(Ag).(Aj" 1 )). A + is a shortcut for A.A*. In the following, we 
identify the relation [A] with the expression A whenever no confusion may occur. 


Proposition 2.3. For caterpillar expressions A and F, 

(A.A) -1 = A -1 .A -1 , (A U A)" 1 = A" 1 U A" 1 , 
(A*) -1 = (A -1 )*, (A -1 ) -1 = A. 


5 In [Briiggemann-Klein and Wood 2000] the inverse is only supported on atomic expressions, i.e. 
relations from F. We do not assume this restriction, but this is an inessential difference. 
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Using Proposition 2.3, we can efficiently “push down” inversion operations to the 
atomic expressions. 

Proposition 2.4. Each caterpillar expression E over r can he rewritten into 
an equivalent ~ 1 -free caterpillar expression overTU{R~ l \ R £ T} in time 0(\E\). 

Example 2.5. The document order relation -< is a natural total ordering of 
dom used in several XML-related standards (see e.g. [World Wide Web Consortium 
1999]). It is defined as the order in which the opening tags of document tree nodes 
are first reached when reading an HTML or XML document (as a flat text file) 
from left to right. For an example, consider the document 

(a) (a) (/a) (a) (a) (/a) (a) (/a) (/a) (a) (/a) (/a) 

which corresponds to a tree of six nodes, all labeled “a”. If we traverse the document 
from left to right and assign i to the *-th opening tag that we encounter, we obtain 

(«)i (a) 2 (/«) ( a ) 3 ( a >4 (/«) ( a )s (/«> (/«> ( a )e (/«> (/«> 

For each 1 < i < 6, let us assign node id n, to the node corresponding to the 
opening tag with index i. Then, the document tree is as shown in Figure 1 (a) and 

n\ -< n 2 -n n 3 ~< n 4 -< n$ -< n§. 

Over T ur , -< can be defined by the caterpillar expression 

child + U (child -1 )*.nextsibling -1 ".child*, 

where “child” is a shortcut for firstchilcl.nextsibling*. This caterpillar expression 
basically says that n -< n! iff n' is a descendant of n or n' is in a subtree rooted by 
a node that is a right sibling of a node on the path from n to the root node. It is 
not difficult to verify that this is a correct alternative definition of 

By Proposition 2.3, child -1 is also equivalent to (nextsibling -1 )*.firstchild -1 . □ 

3. MONADIC DATALOG 

In this section, we provide a formal background for the remainder of the paper. We 
define the language of monadic datalog and provide known - sometimes folklore - 
results regarding its expressiveness and complexity. 

3.1 Syntax and Semantics 

We briefly define the function-free logic programming syntax and semantics of dat¬ 
alog (cf. [Abiteboul et al. 1995; Ceri et al. 1990] for detailed surveys of datalog). 

A datalog program is a set of datalog rules. A datalog rule is of the form 

h <— 6i, • • •, b„. 

where A, b\, ..., b n are called atoms, h is called the rule head, and b \, ..., b n 
(understood as a conjunction of atoms) is called the body. Each atom is of the form 
p{x i,..., x m ), where p is a predicate and x\, ..., x m are variables and constants 
(from a finite domain dom). Variable-free atoms, rules, or programs are called 
ground. Rules are required to be safe , i.e., all variables appearing in the head also 
have to appear in the body. A body atom which contains all variables of its rule 
is called a guard , and a rule containing such an atom is called guarded. Predicates 
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that appear in the head of some rule of a program are called intensional, all others 
are called extensional. An extension is a set of ground atoms that are assumed to be 
true. We assume that for each extensional predicate, a (possibly empty) extension 
is given as input data. By signature, we denote the (finite) set of all extensional 
predicates (with fixed arities) available to the program. By default, we use the 
signatures T r k and r ur for ranked and unranked trees, respectively. 6 

Let r be a datalog rule. By Vars(r) we denote the set of variables occurring in r 
and by Body(r) we denote the set of body atoms of r. 

A valuation is a function </> : ( Vars(r ) U dom) —► dom which maps each variable 
to an element of dom and is the identity on dom. Given an atom p(x i,..., x m ), let 

4>(p( x l. ■ • •, x m)) ■= p((j>(xi), </>(x n )). 

We define the semantics of datalog by means of the fixpoint operator Tp. 

Definition 3.1 (Immediate consequence operator). Let V be a datalog 
program and B the (finite) set of all ground atoms over the domain dom and a 
given signature. The immediate consequence operator Tp : 2® —* 2® is defined as 

Tp(X):= X U {(/>(h) | there is a rule h <— bi,... ,b„. in V and a valuation <f> 
on the rule s.t. <j>(bi ),..., <j>(b n ) £ X}. 

Let 7 p,:=X and Tfi +1 :=Tp{Tfi) for each i > 0, where X is the database given as 
a set of ground atoms. The fixpoint Tp = Tfi +1 of the sequence Tp, Tp, Tp ,... is 
denoted by Tfi. □ 

It is clear that Tp eventually reaches a fixpoint because it ranges over a finite 
universe dom given with the database and the sequence Tp, Tp, Tp,... is strictly 
(because Tp is deterministic) monotonically increasing until the fixpoint is reached. 
The semantics of V on X is defined as Tfi. 

Monadic datalog is obtained from full datalog by requiring all intensional pred¬ 
icates to be unary. By unary query, for monadic datalog as for MSO, we denote 
a function that assigns a predicate to some elements of dom (or, in other words, 
selects a subset of dom). For monadic datalog, one obtains a unary query by dis¬ 
tinguishing one intensional predicate as the query predicate. In the remainder of 
this paper, when talking about a monadic datalog query, we will always refer to 
a unary query specified as a monadic datalog program with a distinguished query 
predicate. 

Example 3.2. We construct a monadic datalog program over r U r which, given 
an unranked tree, computes all those nodes which are roots of subtrees containing 
an even number of nodes labeled “a”. 

The program uses three pairs of intensional predicates, Bi, Ci, and Ri (i £ {0,1}). 
Bfin) denotes the number of nodes (modulo 2) labeled “a” in the subtree of n 
excluding n itself, Cfin) the count (mod 2) of such nodes in the subtree of n (thus, 
including n), and Rfin) denotes the sum (mod 2) of the occurrences of “a” in the 
subtrees of nodes in the ordered list of siblings of n from the right up to n. 

®Note that our tree structures contain some redundancy (e.g., a leaf is a node x such that 
—>(3j/)firstchild(rc, J/)), by which (monadic) datalog becomes as expressive as its semipositive gen¬ 
eralization. Semipositive datalog allows to use the complements of extensional relations in rule 
bodies. 
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The program consists of the rules 


Bq(x) e 

- leaf(x). 

(1) 

Bi(x 0) <- 

- firstchild(a;o, x), Ri(x). 

(2) 

^(i+i) mod 2 v) "*■ 

- Bi(x), label a (a;). 

(3) 

Ci(x) - 

- Bfix), label;(a;). 

(4) 

Ri(x) ± 

- lastsibling(x), Cfix). 

(5) 

mod 2 ( J ' 0 ) * 

- Cj(x 0 ), nextsibling(xo, x), Rfix). 

(6) 


for each i,j G {0,1}, I G (£ — {a}). The query predicate is Co (“even”). 

Now consider a 4-node tree (dom = {rii, 712, 77,3, 714}) consisting of a root node n\ 
and three children (from left to right) n 2 , 713, and 744. All nodes are labeled “a”. In 
the tree structure, we have root = {ni}, leaf = {712,77.3,714}, firstchild = {(ni,n 2 )}, 
nextsibling = {(712,713), (713,714)}, lastsibling = {77.4}, and label Q = dom. 

The computation of fixpoint Tp for the program given above proceeds as follows. 
Derived atoms are annotated with the rules that entail them (as superscripts). 

Tp = {root(ni), leaf(n 2 ), leaf(ri3), leaf(ri4), 

firstchild(7ii,712), nextsibling(n 2 ,713), nextsibling(ri3,77.4), 
lastsibling(n4), label a (7ii), ..., Iabel a (n4)} 

Tp = U {P 0 (n 2 ) (1) , Bq (t7 3 ) (1) , B 0 (n4)W} 

Tp = Tp U {C 1 (n 2 ) (3) , Ci(n 3 ) (3) , Ci(ti4)( 3 >} 

T* = T 7 ?U{I? 1 (774) (5) } 

Tp = TpU {A.o(77 3 ) (6) } Tp = U {Bx (th) (2) | 

T* = T* U {Pi(ti 2 ) (6) } Tp =Tp\J {<7o(t»i) (3) } 

Now, Tp = Tp = Tp. The query Q = {x | C'o(x) G Tp} evaluates to {ni}. □ 

3.2 Expressiveness and Evaluation Complexity 

The following result is part of the database folklore: 

Proposition 3.3. Over arbitrary finite structures, each monadic datalog query 
is III -MSO-definable. 

Proof. Let V be a monadic datalog program and w.l.o.g. let Pi be the query 
predicate. We encode the query defined by V as 

<p(x) ■= (VPi) • • • (VP„) [SAT(Pi ,..., P n ) - x G Pi) 

where {Pi,...,P ra } is the set of all intensional predicates appearing in V and 
SAT (Pi ,..., P n ) is the conjunction of the logical formulae corresponding to the 
rules of V in the following way. Given rule h <— bi ,..., b m ., its formula is 

(Vzi) • • • (Vs*,) (bi A • • • A b m —> h ), 

where zi ,..., Zk consists of all variables appearing in the rule and an atom Pi (a:) is 
understood as x G Pi- 
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This can easily be justified by the fact that the minimal model 7 .p is the in¬ 
tersection of all models of P, and an interpretation Pi,... ,P n is a model of V iff 
SAT (Pi ,..., P n ) is true. □ 

Throughout the paper, our main measure of query evaluation cost is combined 
complexity, i.e. where both the database and the query (or program) are considered 
variable. 

Proposition 3.4. Monadic datalog (over arbitrary finite structures) is NP-com- 
plete w.r.t. combined complexity. 

Proof. Since all intensional predicates are unary, a proof (tree) can be guessed 
and subsequently verified in polynomial time, and NP-hardness follows from the 
NP-completeness of boolean conjunctive queries (and thus single-rule programs). □ 

We discuss a number of fragments that can be evaluated efficiently. 

Proposition 3.5. Given a ground datalog program V and a structure a, V can 
be evaluated on a in time OdPl + |oj). 

Proof. By adding the facts from “database” a to the variable-free (and thus 
propositional) program V, we obtain an instance of propositional Horn-SAT, which 
can be solved in linear time [Dowling and Gallier 1984; Minoux 1988].' □ 

Proposition 3.6. Let P be a datalog program in which each rule is guarded by 
an extensional atom. Then, P can be evaluated on structure a in time 0(\P\ * |cx|). 

Proof. For each rule r with guard R(xi,...,Xk), we proceed as follows. For 
each tuple (ci,..., Ck) £ R a , we generate a ground version of r by replacing each 
occurrence of variable Xi in r by Ci. Only 0(\R a \) such rules are created for each r. 

The ground program obtained that way is of size Od^l * |oj), can be computed 
within the same time bounds, and is equivalent to P. We apply Proposition 3.5 to 
complete the evaluation of P. (0(\P\ * |oj + |cr|) = 0(\P\ * |cr|).) □ 

Let Datalog LIT [Gottlob et al. 2002] be the fragment of datalog in which the 
body of each rule either (i) consists exclusively of monadic atoms or (ii) contains 
one atom, the guard, in which all variables of the rule occur. Monadic Datalog LIT 
is the fragment of Datalog LIT in which all head atoms are unary. 

Proposition 3.7 [Gottlob ET AL. 2002]. Given a monadic Datalog LIT pro¬ 
gram P and a finite structure a, P can be evaluated in time 0(l?^| * |oj). 

As already propositional Horn-SAT is P-complete (e.g., [Papadimitriou 1994]), all 
of the above problems (with the program considered variable) are actually P-hard. 

4. EXPRESSIVENESS AND COMPLEXITY OF MONADIC DATALOG ON TREES 

This section is divided into three parts. First, we characterize the complexity of 
evaluating a program on a tree; second, we show that monadic datalog on trees 
captures the unary MSO queries, and third, we study the relationship between 
monadic datalog and query automata and prove a new result on the complexity of 
the query containment problem for monadic datalog on trees. 

' An earlier linear-time algorithm for the equivalent implication problem for functional dependen¬ 
cies can be found in [Beeri and Bernstein 1979]. 
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4.1 Evaluation Complexity 

We start by characterizing the complexity of evaluating monadic datalog programs 
over trees. We first need to introduce the standard notion of a functional depen¬ 
dency. Let R be a relation. By $i, we denote the i-th column of R. A func¬ 
tional dependency R: $i —► $j means that R satisfies the constraint that whenever 
(ai,..., a*;), bk) £ R such that a* = bi, the values a 7 - and bj must be equal 

as well. Observe that by definition, 

Proposition 4.1. Each binary predicate 8 R in t t \. or r ur has both a functional 
dependency R : $1 —► $2 and a functional dependency R : $2 —► Si. 

For instance, each node has at most one first child and is the first child of at 
most one other node. 

Theorem 4.2. Overr r j~ as well asr U r, monadic datalog has 0(\V\*\dom\) com¬ 
bined complexity (where \V\ is the size of the program and \dom\ the size of the tree). 

Proof. We will call a rule r connected if and only if the (undirected) graph 
G r = ( V,E) with V = Vars{r) and E = {{a :,y} \ R(x,y) £ Body(r)} is connected. 

We proceed in three steps. First, we translate V into a program V' in which 
each rule is connected. For each rule r £ V, in case G r is not connected, we split 
off each connected component C of G r that does not contain the variable in the 
head of r, create a rule r' with a propositional head predicate p and Body{r') = C , 
and replace C in r by p. For instance, the rule p(x) <— Pi(x), P 2 (y)- which is not 
connected is rewritten into two rules p(x) <— pi(x), b. and b <— P 2 (y)- Here, b is a 
new propositional predicate. We obtain a set of connected rules V in linear time. 

Second, we compute a “ground” program V" from V' which consists, for each 
rule r of V, of all ground rules obtainable by instantiating the variables in r with 
nodes from dom. By Proposition 4.1, the connectedness of G r ensures that each 
variable of r functionally determines all others. There are only 0(|dom|) relevant 
variable-free ground instantiations of r, which can be computed in time 0(|dom|). 

Finally, by Proposition 3.5, the fixpoint of ground program V" can be computed 
in time and is equivalent to the fixpoint of V on the input tree minus the 

propositional atoms (e.g., b in our example above) added in the first step. Thus, 
the three steps in total require 0(\V | * |dom|) time. □ 

Therefore, we have both linear time data and program complexities. 

Remark 4.3. The data complexity part of Theorem 4.2 also follows from the 
fact that the data complexity of MSO queries over finite structures of bounded tree- 
width is in linear time [Flum et al. 2001] and the fact that ranked and unranked 
trees over a fixed labeling alphabet are of bounded tree-width. □ 

4.2 Expressiveness of Unary Queries 

In this section, we show that a unary query over ranked or unranked trees is MSO- 
definable exactly if it is definable in monadic datalog. All that needs to be shown 
is that each unary query in MSO (over trees) can be expressed in monadic datalog, 
as the other direction follows from Proposition 3.3. 

®That is, one of (childfc)fc< K ■ firstchild, and nextsibling. 
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Theorem 4.4. Each unary MSO-definable query over t r y. (resp., T ur ) is also 
definable in monadic datalog over (resp., T U r)- 


Given a tree t and a node v £ dom t , let t v denote the subtree of t rooted by v 
and t v the envelope (or complement) of t v in t, which is obtained by removing all 
of t v in t except for node v itself (that is, t v and t v share exactly v). Given a tree 
Si that contains w as a leaf and a tree S 2 , let Si[wj —> S 2 ] be the tree obtained by 
the fusion of w and the root node of S 2 - Notably, t v [v —> t v \ denotes the insertion 
of t v into t v at node v, which again amounts to t. 

In the following, let structures a with one distinguished constant c be denoted 
as (er, c). By (oq,ci) =)f so ( 02 , 02 ), we denote that for all MSO sentences ip of 
quantifier rank k, (oq, ci) 1= p if and only if (< 72 , C 2 ) t= p. (Thus, (oq, ci) and ( 02 , C 2 ) 
are indistinguishable by MSO sentences of quantifier rank k.) Clearly, =^ rso is an 
equivalence relation. We also call its equivalence classes the =^ fSO -types. 

Proposition 4.5. Given a natural number k, 

(1) there is only a finite number of equivalence classes of =^f so , and 

( 2 ) there is an effective procedure for deciding whether (oq,Ci) =^ rso ( 02 , C 2 ). 


Such a decision procedure is provided by Ehrenfeucht-Fra'isse games, which ex¬ 
actly capture the essence of quantification in MSO over finite structures. Given the 
following proposition, we do not need to ponder about them in detail, but refer to 
[Ebbinghaus and Flum 1999] for a detailed account of their theory. 


Proposition 4.6 (Folklore, cf. [Neven and Schwentick 2002]). Let 
t and s be trees with nodes v £ domt and w £ dom s , both with n children (n > 0/ 
Let Vi and Wi (1 < i < n) be the i-th child (from the left) of v and w, respectively. 


( 1) If (t Vi , Vi) =(f so ( s Wi ,Wi ) for all 1 < i < n and labelt{v) = label s (w) then 
( t v ,v) = k so ( s w ,w ). 

( 2 ) Let i £ {!,..., n}. If (t v ,v) =)f so labelt(vi) = label s (wi), and 


(t Vj ,Vj) — k 

(3) If(tf,v)=^ so 


= MSO ( s Wj ,Wj ) for j 
(s^, w ) and (t 


£ { 1 ,... ,n} - {*}, then ( t Vi ,Vi ) =™ so ( s v 


,v) ^ 30 { 


s q „,w) then ( t,v ) =(f so (s 


p w i )• 
»)• 


Now we are ready to show the main results of this section. 


Proof of Theorem 4.4. We first consider the ranked tree case. 

Given an MSO formula ip of quantifier rank k with one free first-order variable, 
we compute a monadic datalog program with a distinguished query predicate ip 
which defines the same (unary) query. The main idea of the proof is that we can 
compute the (relevant) ^jj^^-types for <p together with a witness structure for each 
type (equivalence class) and decide already when computing the program for which 
witness structures (t, v) and thus s^ fSO -types it holds that ( t , v) \= ip. Computing 
=^ fSO -types for nodes wofa given data tree t , and thus deciding (t, v) 1= tp, is easy 
enough to be carried out by a monadic datalog program. 

In the following, the arrows ) and J, are meant to support the intuition that the 
=^ f,sc ’-types of subtrees t v and their envelopes t v are mainly computed bottom-up 
and top-down, respectively. 

We maintain two sets of types ©], and ©j:, representing =^ fSO -types computed for 

lS/lSO T 

subtrees t v of nodes v in trees t (denoted T k ' 1 (t v , v)) and for their counterparts 
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( 'Tjf SO ’^(t v ,v )), respectively. Moreover, we maintain a witness W(9) of each type 
8 , i.e. structures W(T^ ISO '\t v ,v)) and W{T^ ISO '^{t v , v)) such that (t v ,v) =^ SO 
W{T^ ISO '\t v ,v)) and (t v ,v) =^ so W{T^ ISO ' [ (t v , v)). The types in ©£ and ©£ 
will serve as predicate names in the monadic datalog program to be constructed. 

Given a structure (t,v), we compute its type T^ ISO '\t,v) (or T^ SO '^{t,v)) by 
trying for each 9 £ ©£ (or ©j;) whether (t,v) =^ SO W(9). By Proposition 4.5, 
we have an effective procedure for deciding this. If such a 8 exists, it is returned. 
Otherwise, we invent any new token 9 , add it to &l (or ©j;), set W(9) := (t,v), 
and return 8 . 

It is convenient to compute both the sets ©j. and ©j; and the monadic data¬ 
log program V as parts of the same construction, which consists of three parts, 
analogously to the three parts of Proposition 4.6. Initially, V = ©[. = ©j; = 0. 

(1) For 0 < n < K (where K is the maximum rank of the trees), for each com¬ 
bination of n elements 9\,... , 8 n of ©£, and for each l £ E, let t be the tree 
constructed from a new root node v labeled l and W{ 8 \),, W( 8 n ) as chil¬ 
dren. We set 9 := T^ SO ’\t, v). (Now, 9 £ Q\ and W{9) = (t,v).) Moreover, 
if n = 0 , we add the rule 

8 (x) <— leaf(x), label; (x). 

to V; otherwise, we add 

9(x) <— childi(x,a:i), 9\{x\), ..., child„(x, x n ), 9 n (x n ), label; (a;). 

This is repeated until no new =^ f,so -types 9 can be added to ©[.. Termination 
is guaranteed as there are only finitely many =^ fSO -types and labels in E. 

(2) To compute ©j;, we start at the root node. For each l £ E, let t roott be a tree 
that consists simply of a (root) node labeled l and let 9 := T^ Iso ^{t roott , root t ). 
We add the rule 

9(x) «— root (a;), label; (x). 

to V. For nodes V; other than the root node, the =^ rso -type of ( t Vi ,Vi ) depends 
also on the =^ fSO -types of the siblings. For all 1 < i < n < K, all 9i ,..., 9 n £ 
©J. s.t. W(9j) = ( tj,Vj ) for all 1 < j < n, and 9 £ ©j: with W{9) = (t Vi v), let 
t Vi be the tree obtained by appending the list of trees 1 1 , • • •, t;_i, v^, t; + i,..., t n 
to the leaf node v of t v . Let 9[ := T^ IS °^(t Vi ,Vi). We add the rule 

9'^Xi) <— 9(x), child; (x, Xi), label;(iEi), A (childj(x, Xj), 9 j (x j )). 

l<j<n, jjii 

to V. Types and the witness structures are maintained as for ©J), and termi¬ 
nation is guaranteed. 

(3) For each 9\ £ ©J. and each 82 £ ©^ such that W(9\) = where v\ is the 

root of ti, and W(# 2 ) = (£ 2 ,^ 2 ), where V 2 is a leaf of t 2 , we proceed as follows. 
If (t 2 \v 2 —> ti], Vi) t= ip, we add the rule 

ip(x) <— 81 (x), 82 (x). 

to V. 
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The sets of predicates used in the left-hand sides of rules added to V in the three 
parts of our construction, ©£, ©j;, and {</?}, are disjoint, so we can consider the 
subprograms of V defined in each of the three parts individually, assuming in each 
case that the fixpoint for the rules from previous parts is available as input. 

In part (1) of our construction, ©]. is computed following Proposition 4.6 (1) 
and a bottom-up intuition. We add types to 0*. as long as we can construct 
structures of new types by combining the witness structures of existing types using 
labels from E. The monadic datalog rules defined there are a direct realization of 
Proposition 4.6 (1). It is easy to see that the rules of (1) in isolation compute an 
atom 6{v) on a tree t exactly if 0 = T^ ISO '\t v ,v). 

In part (2), we compute Oj, using Proposition 4.6 (2) and a top-down intuition. 
Given an input tree t , the monadic datalog rules added in part (2) compute 9(y) 
for each node v and the one 9 £ ©j; such that 9 = T^ ISO '^(t v ,v). 

In part (3) of our construction, we use Proposition 4.6 (3) to combine the types 
computed for each node to answer the query p. Here, for types 9\ and 02 with 
W{9\) ( t v ,v ) and W(9 2 ) =^ SO ( t v ,v ), we do not need to explicitly compute 

the combined =^ s ' c> -type of ( t , v). By our construction, if 9i(v) and 0 2 (u) evaluate 
to true for V and the program contains the rule p{x) <— 9\{x) 1 02(x), we know that 
tp holds for the combined type and that v has to be part of the query result. 

This concludes our proof for the ranked tree case. The unranked tree case (with 
structures over T ur ) can be reduced to the former as follows. 

A binary tree (over T r f~ and with maximum rank K = 2) is obtained from an arbi¬ 
trary unranked tree by the renaming of “firstchild” in r ur to “childi” and “nextsi- 
bling” to “chikV (cf. Figure 1). The same renaming of relation names can be 
applied to a query ip on unranked trees. If we leave aside ranked alphabets, the 
unranked tree case is thus equivalent to the ranked tree case r r ^. Since we did not 
rely on the labels being ranked in the proof for the ranked tree case above (nor did 
the original proofs of Proposition 4.6), we are done. □ 

By this result, it is also easy to see that a tree language (for ranked as well as 
unranked trees) is regular iff it is definable in monadic datalog, given an appropriate 
notion of acceptance of an input tree. We say that a monadic datalog program V 
with a query predicate “accept” accepts a tree t iff accept(roof t ) £ Tip (i.e., the root 
node of t is in the inferred extension of “accept”). V recognizes the tree language 
C = {t | V accepts t}. 

Corollary 4.7. A tree language is definable in monadic datalog exactly if it is 
definable in MSO. 

This is similar to the folklore result that monadic fixpoint logic over trees captures 
MSO (with respect to tree language acceptance). 

4.3 Simulating Query Automata in Monadic Datalog 

As pointed out earlier, there is a need for formalisms that capture the expressive 
power of unary MSO queries selecting nodes from trees. Clearly, MSO itself is by 
far too expensive to be used in practice; Another previous formalism to achieve 
this task is that of query automata [Neven and Schwentick 2002]. As we show, 
while query automata are much more complicated, each query automaton can be 
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translated into an equivalent monadic datalog program. Our reduction is very 
efficient, and can be carried out in logarithmic space. Based on this fact, we can 
show that the containment problem for monadic datalog over ranked or unranked 
trees (represented by T r k or r U r) is EXPTIME-hard. This strengthens an earlier 
result from [Cosmadakis et al. 1988] that the containment problem for monadic 
datalog on arbitrary finite structures is EXPTIME-hard. 

Definition 4.8 [Neven and Schwentick 2002]. A ranked query automaton 
(' QA r ) - that is, a two-way deterministic ranked tree automaton with a selection 
function - is a tuple 

A = (Q, S, F, S, , 8 r0 ot, 5leaf : 

where Q is a finite set of states, F C Q is the (nonempty) set of final states, s £ Q 
is the start state, E is a ranked alphabet, the <5’s are transition functions, and 
A : Q x X —» {_L, 1} is the so-called selection function. Let there be a partition of 
Q x X into two disjoint sets U and D. 

(1) : U— K —> Q is the transition function for up transitions. 

(2) : D x {1,..., K} —> Q* is the transition function for down transitions. For 
each i < I\ , 8±(q, a, i ) is a string of states of length i. 

(3) S roo t : U —> Q is the transition function for root transitions. 

(4) 8i ea f : D —► Q is the transition function for leaf transitions. 

Let t be a ranked tree. A cut is a subset of dom* which contains exactly one 
node of each path from the root to a leaf. A configuration of A on t is a mapping 
c : C —> Q from a cut C of t to the set of states Q of A. 

The automaton A makes a transition between two configurations c\ : C\ —> Q 
and C 2 : Ci —> Q 1 denoted by ci —> C 2 , if it makes an up, down, root, or leaf 
transition: 

(1) A makes an up transition from Ci to C 2 if there is a node n such that (a) the 
children of n, say, ni,..., n m , are in C±, (b) Ci = (C 1 — {ni,..., n m }) U{n}, (c) 
C 2 (n) = Si((ci(n\),label(rii)), ..., ( ci(nm),label(n m ))), and (d) C 2 is identical 
to ci on Ci fl C 2 . 

(2) A makes a down transition from ci to ci if there is a node n s.t. (a) n £ C 1 , 
(b) C 2 = (Ci — {n}) U {ni,..., n m }, where {ni,..., n m } is the set of children 
of n, (c) 02 ( 711 ) • • • ci(n m ) = 5i(ci(n),label(n), arity(n)), and (d) C 2 is identical 
to ci on Ci fl Ci. 

(3) A makes a root transition from Ci to C 2 if (a) Ci = Ci = {root t }, where root t 
denotes the root node of t, and (b) ci(root t ) = S roo t(ci(roott), label(roott))■ 

(4) A makes a leaf transition from ci to ci if there is a (leaf) node n s.t. (a) n £ Ci, 
(b) Ci = Ci, (c) ci(n) = 5i ea f (ci(n), label(n)), and (d) ci is identical to ci on 
Ci - {n}. 

The start configuration c : C —► Q has C = {roott} and c(roott) = s. Any 
configuration with c(roott) £ F is an accepting configuration. (That is, a 2DTA r 
starts at the root and terminates there.) A run is a sequence of configurations 
ci,... ,c m such that ci c m and ci is the start configuration. A run is 
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accepting if c m is an accepting configuration and there does not exist a c m +i such 
that c m > c m +i- 

Since often a number of transitions can be made in parallel, there are usually 
many different sequences of transitions that are possible. However, because of the 
disjointness of U and D , given a node n with some label and a (“current”) state q , 
at most one (kind of) transition involving n is possible at any point in time, and for 
all nodes, the sequence of states in which they are visited is the same in all these 
runs. Thus we can consider this type of automaton as deterministic and refer to the 
run of A rather than a run of A. Even though an automaton of the kind specified 
can run forever on an input tree, we can restrict ourselves to automata that always 
terminate. (This is a decidable property [Neven and Schwentick 2002].) 

The selection mechanism of A is defined as follows. A query automaton A selects 
a node n in configuration c : C —> Q if n £ C and A(c(n), label(n )) = 1. A selects n 
if the run ci,..., c m is accepting and if there is an 1 < i < m such that n is selected 
by A in Ci. □ 

Thus, a query automaton computes the set of nodes selected at any time during 
the run, not just in the terminating configuration (which, by our definition, only 
contains the root node in its cut). 

Example 4.9. Consider the following query on binary trees: Which nodes are 
roots of subtrees that contain an even number of nodes labeled “a”? We evaluate 
this query by first going down to the leaves of the tree and then, while ascending 
towards the root, summing up the sizes of subtrees (modulo two). 

We construct a ranked query automaton A as follows. The automaton A has 
three states s^, So, and Si; s^ is the start state and is used while going down, 
and so and si represent the number of nodes below the current (modulo 2) while 
subsequently going up (these are also the final states). We need the following 
transitions: 

(1) first go down all the way to the leaves: Ay(sp, *, 2) = (s^, s^) 

(2) a leaf node has no children: ^ ea /(s|, *) = so 

(3) when ascending, we count all nodes below the current node (the label of the 
current node is not accessible): ^((s^, Zi), ( Sj , I 2 )) = s x for i,j e (0,1}, where 

x =[i+j + x{h = a) + x{h = a)] mod 2 
with x{t rue ) = 1 and y(/aise) = 0. 

The selection function A is T except for A(so, ~<a) = 1 and A(si, a) = 1. 

Now consider the tree 


n 0 

/ \ 

ni n 2 


with label 0 


. . „ . . Si'.no S leaf' ni S leaf :rl2 S r :n 1 ,n 2 

dom. lire run of A is cq —> ci —> c 2 —> C 3 —» C 4 with 
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the cuts Ci and configurations Cj 



r "c 

II 

CO 

: n 0 - 

* S I 



c\ 

= {ni,n 2 } 

Cl 

: m - 


n 2 - 

-* s i 

C 2 

= {ni,n 2 } 

C2 

: m - 

"*■ S 0; 

n 2 - 

Si 

C 3 

= {ni,n 2 } 

C3 

: m - 

■* s 0 , 

n 2 - 

-> So 


O 

II 

c 4 

: n 0 - 

So 




The result of our query on the given tree is empty, as we have an odd number of 
nodes labeled “a” in all subtrees. □ 

Given a ranked query automaton, let an index i s.t. n G Ci be called a crossing 
index on n. Let there be states qo,q G Q , nodes no,n such that no is the parent 
of n, and indexes i < j such that Cj(no) = qo, (<Zo, label (no)) G D, interval [ i + 1 ,j] 
does not contain a crossing index on no, Cj(n) = q , and ( q,label(n )) G U. Then, 
(qo,q,n) is called an imminent return situation. Informally, we have an imminent 
return situation (qo, q, n) in a run if we are about to return from node n (where we 
are currently in state q, thus (q,label(n)) G U) to its parent no and the last time 
no was part of a configuration, it was assigned state qo (so it must have been the 
case that (qo, label(no )} G D). Then, q is uniquely determined by node n and state 
qo, the most recent state assignment of the parent node of n in the run: 

Lemma 4.10. Given state qo and node n, there is at most one state q s.t. 
(qo,q,n) is an imminent return situation. 

Proof. The fact that q functionally depends on qo and n in imminent return 
situations is a direct consequence of determinism as required in Definition 4.8. 

We show this by a simple induction (bottom-up on the tree, with a nested in¬ 
duction on transitions occurring localized at a node which we discuss informally). 
Let no be any node and take any i such that Cj(no) = qo and (qo, label(no )) G D. 

(Induction start.) Let all children of no be leaves. Consider an arbitrary child 
n of no- Initially, we make a down transition from no to n (and its siblings), and 
assign state Ci_|_i(n) to n. Since the automaton is deterministic, Cj+i is functionally 
determined by qo (and the tree). In case (cj+i(n), label(n)) G U, (<?o, c i+i(n), n) 
is the imminent return situation in question and the induction hypothesis (the 
lemma) holds. Otherwise, only leaf transitions are possible. For a leaf transition on 
n from configuration Ck to Ck+ i, the outcome is again uniquely determined by qo 
and k. This is true because the automaton is deterministic and the transition only 
depends on the single state Ck(n). If now (ck+i(n), label(n )) G U , we next return to 
no, so Cfc+i(n) is the unique state such that (qo,Ck+i(n),n) is an imminent return 
situation, and the induction hypothesis is again true. 

(Induction step). Let no have at least one child that is not a leaf. We make a 
down-transition to configuration Ci+i. Consider an arbitrary child n of no- Again, 
initially, Cj+i(n) uniquely depends on qo (and the tree). We have discussed the case 
of leaf nodes above, so assume that n is not a leaf. At any step k of the run before 
the return to no such that n G Ck, only a down transition is possible. This again 
assigns a state to each child of n only depending on Ck(n) (and thus on qo and k). 
Let l > k be the crossing index on n subsequent to k, the time at which we return 
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to n from the excursion down its subtree. The transition to configuration ci is an 
up transition, and as by our induction hypothesis the imminent return situation 
(cfc(n), n') for each child n! of n only depends on Cfe(n) and n' (and Cfc(n) 

in term only depends on go and fc), there is again only one possible up transition to 
n to be made. As discussed above, if ( ci(n),label(n )) £ U, we are done, otherwise 
we continue with a down transition. 

We have not made any assumptions about i. Thus, given go and n (with parent 
no), for all i,j such that (cj(no), Cj(n),n) is an imminent return situation and 
Ci(n-o) = go, Cj(n) is the same. □ 

Now we can state our result for ranked queries. Observe that in a ranked query 
automaton A = (Q, E, F, s, (5f, (5j_, S roo t, Sleaf, A), the sets Q , E, and F as well as 
the graphs of the functions <5j, (5p, Sroot, Si ea f , and A are finite. As it is easy to 
verify, the following LOGSPACE transformation does not depend on the details 
of the representation of A. (Notably, we do not assume an artificially inflated 
representation, such as states, labels, or ranks encoded in unary.) 

Theorem 4.11. Given a ranked query automaton, an equivalent monadic data- 
log query can be computed in logarithmic space. 

Proof. We first provide an intuition and overview of the ideas used in this proof. 
After that, the simulation will be described in detail. 

(1) Let A be a ranked query automaton. The monadic datalog program V to be 
defined below aims at computing exactly all the state assignments made during 
the run of A, in no particular order, formalized as the “history” of A, 

H = {(g, n) | n £ Ci and Ci(n) = g for some *}. 

(2) We do not try to model configurations. Instead, we define V in such a way 
that it monotonically computes state assignment atoms occurring at any time 
during the run of A. As a first attempt, these can be assumed to be of the 
form g(n), where g is a state from A and n is a node of the input tree. 

The encoding mirrors the four kinds of transitions of Definition 4.8 so closely 
that is is easy to see that it is complete. That is, all state assignments made 
during the run of A are certain to be in the fixpoint of our program. 

(3) Rules for down, root and leaf transitions in V cannot cause a violation of 
soundness by themselves. They each only need a single state assignment as 
precondition in their body to “fire” and thus cannot infer state assignment 
atoms that do not eventually become true during the run of A. To extend 
soundness to up transitions, we alter our encoding to use predicate names that 
are pairs of state names. An atom (go,g)(n) intuitively means that at some 
point i during the run of A, Ci{n) = q and the parent of node n was assigned 
state go the last time (before i) that it was part of a configuration. We will 
show that this tweak ensures the desired soundness for up transitions as well. 

Now we describe the simulation in detail. As mentioned, predicate names are 
pairs (of state names) in (Q U {v}) x Q. The symbol v denotes a dummy state 
which we will assign to the imaginary parent of the root node whenever a state 
assignment to the root node has to be made. 
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The encoding V of A is the following set of rules. For all q, q', gi,..., q m £ Q 
and for all a, a\,..., a m £ E, 

( 1 ) (Start state) we add the single rule 

(v, s)(x) <— root(x). 
where s is the start state of A ; 

(2) (Up transition ) if <5^- ((<?i, cti),..., ( q m , a m )) = q', we add the rules 

(qo,q')(x) <- (qo,q)(x), 

childi(x, X\), ..., child m (x, x m ), 

(q,qi){xi), (q,q m )(xm), 

label ai (aq), label am (x m ). 

for all go £ (Q U {v}); 

(3) (Down transition ) if <5j.(g, a, m) = qi ■ ■ ■ q m , we add the rules 

(q, qi)(xi) <— (®, q)(x), child*(a;, a;,), label a (x). 

for all 1 < i < m, go £ (Q U {v}); 

(4) (Root transition) if 5 roo t{q , a) = q ', we add the rule 

(v,q')(x) <— (v,q)(x), label a (x), root (a;). 

(5) (Tea/ transition) if Si ea f(q,a) = q', we add the rules 

{qo,q'){x) <- (qo,q)(x), label*,(a:), leaf(x). 

for all <70 £ (<3 U {v}); 

( 6 ) ( Acceptance) if q £ F, we add the rules 

accept(x) <— root(x), (qo,q)(x). 

for all g 0 G (Q U {v}); 

(7) (Selection function) finally, for each q £ Q and a £ E with A (q,a) = 1, we add 

query{x) <— (qo,q){x), label a (x), accept(y). 
for each qo € (Q U {v}). 

Of course, all datalog variables x,Xi,y in our encoding range over nodes in dom t . 
Given a query automaton, the equivalent monadic datalog program V as dis¬ 
cussed above can be computed in logarithmic space without difficulty. 

It remains to be shown that our reduction is also correct. For a set X C Tip, let 

n ( x ) '■= {(q,n) | (3g 0 ) (g 0 ,g)(n) £ X}. 

We claim that 7r( Tp ) = H , and show this next. 

Regarding the completeness of Tp, it is easy to see that the state assignments 
in our fixpoint Tp are certain to subsume those in H, i.e., n(Tp) D H. Consider 
the definitions of transitions and runs in Definition 4.8. The rules of V closely 
mirror an operational (rule-based) version of these definitions with (superficially) 
weakened preconditions. For example, the definition of down transitions says that 
if Ci(n) = q, a down transition Si(q,a,m) = q\ ■ ■ ■ q m can be executed, resulting 
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in Ci+i(nj) = qj for all 1 < j < m. (Moreover, the definition states that c, is 
undefined on m,... ,n m and c,+i is undefined on n, which is not relevant to our 
completeness claim.) Rather than requiring that node n must be in state q in the 
immediately preceding configuration, the down transition rules of V only require 
that n must have been assigned q in some earlier configuration, plus a condition on 
an earlier state of the parent of n that by our definition of V always holds when 
the down transition precondition of Definition 4.8 holds. An analogous observation 
can be made for the remaining kinds of transitions. 

The other direction (i.e., soundness of T-p) can be shown by induction over the 
computation of T-p . 

—Initially, we obtain T-p = {(r, s)(root t )} by applying the start state rule. (Clearly, 

C H.) 

—Let X (with 7r(X) C H) be the set of facts obtained so far in the fixpoint 
computation. Rules in V which correspond to root, leaf, and down transitions 
have only a single state assignment premise in their bodies. If the premise is true 
with respect to X (and thus H ), the state assignment (qo,q){n) inferred by such 
a rule must again be in some configuration of the run of A and thus be sound 
(that is, (q, n) £ H). 

—It is easy to verify by inspection of our program V that if atom (q, qk){jik) eval¬ 
uates to true and q ^ v, then q is the state that was assigned to the parent of rik 
the most recent time it was visited. If ( qk,label(nk )) £ U , then (■ q,qk,nk ) is an 
imminent return situation. 

Let X (with n(X) C H) be the set of facts obtained so far in the fixpoint 
computation, and let an up transition rule of V infer (go, q')(n) from 

(where (qi, label{n \)),..., (q m , label(n m )) £ U and the nodes n 1 ,..., n m are the 
children of node n). Clearly, (g, qi,m),..., (q, q m , n m ) are imminent return sit¬ 
uations. By Lemma 4.10, the q± 7 ... ,q m only depend on the state q as the most 
recent state assignment to the parent of rik and on the tree. By the induction hy¬ 
pothesis 7 t(AT) C H, at some point i during the run immediately preceding a down 
transition from node n, Cj(n) = q. The subsequent computations in the subtree 
of n are captured by the imminent return situations (g, q±, n \),..., (q, q m , n m ). 
Since all these can be found in X , they follow on i in the automaton run as well. 
It follows that 7 t{X U {(qo, q')(n)}) C H. 

Thus, our claim that Tr(Tp) = H is true. 

The definition of the selection function for a query automaton nicely coincides 
with the monotone semantics of monadic datalog. In part (7) of our monadic 
datalog encoding, we have defined the query predicate query. Clearly, on a tree t, 

{n | A accepts t, (q,n) £ H , and A (q,label(n)) = 1} = {n \ query(n ) £ Tp}. 

Thus, the query defined by V is indeed equivalent to the query defined by A. □ 

Next we consider the corresponding problem over unranked trees. Analogously 
to query automata for ranked trees, we define the class of strong query automata 
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over unranked trees. Let two-way deterministic finite (string) automata (2DFA) be 
defined in the normal way (e.g., [Hopcroft and Ullman 1979]). 

Definition 4.12 [Neven and Schwentick 2002]. A strong unranked query 
automaton (SQA“) is a tuple 

A = (Q, E, F, s, (5f, <5j,, < 5-, Sroot, dieaf, A), 

where Q, F, s , U, D, di ea f, 5 roo t and A are as in Definition 4.8. Let U up and 
U s tay be two disjoint regular subsets of U*. The transition function for up tran¬ 
sitions is now of the form <5^ : U up —> Q , and the transition function for down 
transitions is of the form : D x N —> <5* (where N is the set of natural num¬ 
bers). For each (q,a) € D, L±(q,a):={5i(q 1 a,i ) i G N} is regular; for each 
j e N, <$j ,{q,a,j) must be a string of length and for each q £ Q, the lan¬ 
guage L^(q):={w £ U* | S-\(w) = q} must be regular. The transition func¬ 
tion S _ : Ustay —*► Q* is for so-called stay transitions. We require this function 
to be computed by a 2DFA B = {S, Eg = Q x E, so, <Jg,-Fg, L,-R) over the string 
(ci(ni), label(n i)),..., (ci(n m ), label(n m )) with a selection function A & : S x Eg —> 
QU{1} that - anytime during its run - maps nodes to states such that, upon the 
termination of B , each node has been assigned exactly one state in Q. A makes a 
stay transition at a node n (whose children are n\,... ,n m ) from a configuration 
Ci : Ci —> Q to c 2 : C 2 —>• Q if 

(a) n± : ■ • - , Tlrn £ Cl, 

(b) C 2 = Ci, 

(c) <L((ci(ni), label(m )),..., (ci(nm),label(n m ))) = c 2 (ni) ■ ■ ■c 2 (n m ), and 

(d) c 2 is identical to ci on C\ — (ni,..., n m }. 

We require that at each node, at most one stay transition is made (this is a 
decidable property [Neven and Schwentick 2002] for a given SQA“). 

The definitions of configurations, leaf, root, up and down transitions, run, and 
accepting run carry over from Definition 4.8. The query computed by A and the 
tree language defined by A are defined analogously to Definition 4.8. □ 

Definition 4.12 leaves it open in which form the regular languages L|(g,a) are 
provided. It is clear that each regular language L^(q,a) must be of density 1. 
(A regular language L C E* is said to be of constant density d iff for each i, 

| L fl E*| < d). As a special case of an interesting result for regular languages of 
polynomial density [Szilard et al. 1992; Yu 1997], we have that 

Proposition 4.13. Each regular language of constant density over alphabet E 
can be denoted by a finite union of regular expressions of the form uv*w (where the 
u,v,w are words over T,). 

Conversely, it is clear that every regular language defined by such a regular 
expression has constant density. 

In the following, we will make the assumption that all languages L|(g,a) are 
provided in this normal form 9 . Definition 4.12 also does not specify the form in 

®Note that Definition 4.12 precisely recaptures the definition of SQA“ in the original reference 
[Neven and Schwentick 2002], However, throughout the proofs of that paper, languages are always 
assumed to be in the normal form of Proposition 4.13, so we make the same assumption. 
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which languages Lj(g) are provided. Without loss of generality, we assume each 
such language represented by an NFA. 

Theorem 4.14. Given a SQA U , an equivalent monadic datalog query can be 
computed in logarithmic space. 

Proof. The proof works analogously to the one for the case of ranked queries, 
with the following changes to the encoding of the automaton (which now is an 
SQA“) in monadic datalog. 

(1) Down transitions: 

Let Zq(g,a) C Q* be provided as a regular expression (J i UiV*Wi, where the 
Ui, Vi, and Wi are words over an alphabet consisting of the states of the query 
automaton. 

Intuitively, we need to define a monadic datalog program that checks, at a node 
n with children m ... n m , whether at least one expression mv*Wi has a word 
of length m. This is done in the steps (a) to (e). If such a matching mv^Wi 
is possible, the nodes m ... n m are assigned the new states according to the 
matched word of states UiV^Wi in step (f). The encoding that follows is not 
completely trivial, therefore we provide an example below (Example 4.15). 

We proceed as follows, for each i. 

(a) First, we use temporary predicates to mark the \ui\ leftmost child nodes of 
n as space to be occupied by Ui (1 < k < |iq|, qo £ Q U {v}): 

utmp 9i l (a;i) <— ( qo,q)(x ), firstchild(a:, x\), label Q (a;). 
ntm P q ,i,k+i( x k+i) <- utmp q ^ k (x k ), nextsibling(£ fc ,:r fe+ i). 

(b) Next, we mark the \wi\ rightmost children of n as space to be occupied by 
Wi (1 < l < |wj|, g 0 G Q U {v}): 

wtmp g <— (qo,q)(x), lastchild(a;, x'). 

wtmp <3;i i _ 1 (a; / ) <— wtmp ? i ; (a:), nextsibling(a: , , x). 

(c) All nodes before those marked w, are marked as such: 

bwtmpg^a/) <— wtmp 9 i l (x), nextsibling^', a;). 
bwtmp ^Ji (a; , ) <— bwtmp 9i (a;), nextsibling^', x). 

(d) Next we try to assign a multiple of |rg| markings to the (|m;| + l)-th node 
up to the rightmost node marked “before w”. For each 1 < m < |ui|, 

vtmp 9JiJl (a; , ) <— utmp 9i | M .| (x), nextsibling(a:, x'), bwtmp^^a/). 
vtm P q ,i,m+i( x ') vtmp q im(x), nextsibling(a;, x'), bwtmp 9li (a; , ). 
vtm P q ,i,i( x ') vtm P 9 ,i> i |( a; ) J nextsibling(a;, x'), bwtmp 9i (a:'). 

(e) If the number of ig-markings assigned is indeed a multiple of |uj|, mark the 
temporary facts computed so far (for each subexpression i) as “successful” 
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(■ UiV*u>i contains a word of the right length). 

succ gji (x') <— utmp 9 i nextsibling(x', x), wtmp gil (x). 

succ 9i j(n / ) •*— vtmp g i i(x'), nextsibling(x / , x), wtmp gil (x). 

succ gj j(x / ) <— succ 9i i(x), nextsibling(x, x'). 
succ gj j(x / ) <— succ 9; i(x), nextsibling(x , ) x). 

(f) Finally, for each a £ {u,v,w} and each 1 < j < \af\ where a is the j-th 
symbol in a*, we create rules to compute new state assignments 

(q,a)(x) <— succ gji (x), atmp g i j (x). 

L(U i u i v *Wi) has density one because L|(g,a) has density one, thus there 
is at most one word of states (q, a) that is “written” (in terms of atoms, 
inferred by the program). Clearly, this is true even if there is more than 
one i such that UiV*Wi matches that word. 

(2) Up transitions: Let B = (Q, sq, 5, F) be a nondeterministic finite automaton 
for Lj(q 0 ) (that is, its alphabet is U). For each q 1 £ (Q U {v }),<?2 € Q, we 
create rules as follows. 

(a) For each s' £ S(so, (q,a)), 

tmp 92 S ,(x) «— firstchild(x 0 ,x), (q 2 ,q)(x), label„(x). 

(b) For each s' € S(s, (q,a )), 

tmp 92 5 ,(x') <— tmp 92i# (x), nextsibling(x,x'), (q 2 ,q)(x'), label a (x')- 

(c) For each s £ F, 

bck g2 (x) <— tmp g2jS (x), lastsibling(x). 
bck g2 (xo) <— nextsibling(xo, x), bck g2 (x). 

{qi,qo)(x 0 ) <— (qi,q 2 )(xo), firstchild(x 0 , x), bck 92 (x). 

That is, we traverse a set of siblings from left to right to check whether their 
state-and-label pairs of the sibling nodes constitute a word of language L^qf). 
When we reach a final state of B on the last sibling, we go back to the first 
sibling and from there to the parent. Then we assign the new state and thus 
make our up transition. 

(3) Stay transitions: The encoding of a 2DFA with a selection function A is straight¬ 
forward. Each transition only depends on a single state assignment. As dis¬ 
cussed for the case of query automata for ranked trees earlier, this condition 
entails that the computation of the union of all the configurations run through 
by the 2DFA as a fixpoint of our monadic datalog program and the application 
of a selection function A to this set is sound. Since by Definition 4.12 each tree 
node may only be involved in a stay transition once, there are no difficulties 
in managing temporary predicates to assure the soundness of the simulation of 
the 2DFA. 

An analogous result to Lemma 4.10 can be stated for unranked trees as well. 
The correctness proof of the altered simulation works analogously to the proof of 
Theorem 4.11. Again the reduction can be computed in LOGSPACE. □ 


24 



ni 

n 2 

n 3 

714 

(a) 

U\V J W\ 
U2V2 W 2 


(b) 

UfV^Wi 

U2V2 W 2 

wtmp 9 i 2,l 

(c) 

UfV^Wi 

bwtmp^i 

bwtmp^i 

bwtmp 9) i 

bwtmpq ? i 


U2V2 W 2 

bwtmpq ,2 

bwtmp q ,2 

bwtmp g ,2 


(d) 

ILlV^Wl 

vtmp 9 i i,i 

vtmp g ,i,2 

vtmp 9 l i,i 

vtmp 9 l i , 2 


IL 2 V% W 2 

vtmpq j2 ,i 

vtmp 9 : 2,2 

vtmpg i2 ,i 


(e) 

ILlV^Wl 

U2V2 W 2 

succ^l 

SUCCq,l 

SUCC 9) 1 

SUCC q ,l 

(f) 

UfV^Wi 

U2V2 W 2 

(9,91) 

< 9 i 9 o) 

< 9 ,91) 

{9,10) 


Fig. 2. Stages in the down transition computation of Example 4.15. 

We conclude this section by a clarifying example of the construction for down 
transitions in the previous proof. 

Example 4.15. Consider a node no labeled “a” which is in state q in the current 
configuration Cj. Let Li(q, a) = (qiqo)* U (<?i 9 o)*< 7 i- We first decompose the regular 
expression into the two subexpressions u\v(wi and U 2 V 2 W 2 with u\ = W\ = U 2 = e, 
v\ = V 2 = {qiqo)i and w 2 = < 71 . Assume that the current node no to which we 
apply the down transition has four children. The fixpoint computation of the mo¬ 
nadic datalog encoding for down transitions proceeds in the stages (a)-(f) shown 
in Figure 2 . In stage (d), the word V 2 can only be assigned once fully and in part 
for the second time (as n4 is blocked by the word u^)- Thus succ 9i 2 cannot be 
inferred in stage (e). The first subexpression, however, can be used to generate a 
four-symbol word qiqoqiqo, and consequently to make a down transition. □ 

The reductions presented in this section also constitute alternative proofs of the 
expressiveness results of the previous section, as the two forms of query automata 
presented capture unary MSO queries over trees. 

Proposition 4.16 [Neven and Schwentick 2002]. A unary query over ran¬ 
ked trees is MSO-definable iff there is a ranked query automaton which computes 
it. A unary query over unranked trees is MSO-definable iff there is an SQA U that 
computes it. 

COROLLARY 4.17. For each unary MSO query over ranked (unranked) trees, 
there exists a monadic datalog program over T r ^ (r ur ) that defines the same query. 

Proof. By Proposition 3.3, all monadic datalog queries can be expressed in 
MSO. The other direction immediately follows from our reductions of Theorems 4.11 
and Theorem 4.14. □ 

Moreover, monadic datalog (over trees) also inherits a hardness result for the 
query containment problem from query automata. By the query containment prob¬ 
lem for query automata , we refer to containment between the sets of nodes selected 
by two such automata rather than containment of the tree languages accepted. For 
its role in query minimization, containment between two distinguished predicates 
of two monadic datalog programs is the prototypical query optimization problem. 
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Proposition 4.18 [Cosmadakis et al. 1988]. Containment of monadic dat- 
alog queries over arbitrary finite structures is EXPTIME-hard and in 2-EXPTIME. 

Proposition 4.19 [Neven and Schwentick 2002]. The query containment 
problem for ranked query automata as well as for SQA U is EXPTIME-complete. 

Proposition 4.19 and our reductions imply that the EXPTIME-hardness result 
of Proposition 4.18 already holds for trees: 

COROLLARY 4.20. The query containment problem for monadic datalog overr 
as well as over T U r is EXPTIME-hard. 

Proof. By Proposition 4.19, the query containment problem for query automata 
is EXPTIME-hard. Since EXPTIME is closed under LOGSPACE-reductions and 
Theorems 4.11 and 4.14 offer LOGSPACE-reductions from query automata to 
monadic datalog programs, the query containment problem of monadic datalog 
is EXPTIME-hard as well. □ 

The evaluation of a monadic datalog query has the strong points of guaranteed 
termination and even running time linear in the size of the program and linear in 
the size of the tree. This is in stark contrast to runs of query automata which, even 
if they terminate, may take superpolynomially many steps to do so. Our simulation 
of query automata in monadic datalog allows for an efficient means of evaluating 
query automata. We demonstrate this for ranked query automata, but the same 
case can be made for unranked query automata as well. 

Example 4.21. Given an integer a > 1, let P = 2“ and let A /3 be a ranked 
(K = 2) query automaton over alphabet X = {a} with states Q = {qij | 1 < i < 
(3+1, 1 < j < (3 + 1}, start state grp, single final state qi,p+i, D = {(qij,a) \ 
1 < * < (3 + 1, 1 < j < P}, U = {(qij 3 +i,a) | 1 < i < (3 + 1}, and transition 
functions defined as a, 2) = (qi,i,q jt i), Sl((q i: p +1 ,a),(qj t p +1 ,a)) = fty+i, 

and 5i ea f(qip, a) = qi,p+i for 1 < i < (3 + 1, 1 < j < (3. Any selection function will 
do as we only care about the length of the run. 

Now consider runs of Ap on complete binary trees in which all nodes are labeled 
a. Let n = |dom t |, which is proportional to the size of the tree. In a run of Ap 
on such a tree, from each non-leaf node v, once visited, we first make P down 
transitions before we return to the parent of v with an up transition. Each node 
at depth d is thus visited Q((3 d ) times. Obviously, the depth of a complete binary 
tree is log 2 (|dom t | + 1) — 1. Therefore, such a run takes work 8(n • /3 log 2 ( n + 1 ) -1 ) = 
0(n • (h±!) 1o S 2 d) = @((2i±i) a + 1 )_ □ 

The encoding of any query automaton Ap in monadic datalog, on the other hand, 
runs in time linear in the size of the tree and quadratic in the size of Ap (which is 
proportional to P 2 = 2 2a ), i.e. in time 0(/3 4 • n) = 0(2 4a • n). 

5. A NORMAL FORM FOR MONADIC DATALOG ON TREES 

As we show in this section, each monadic datalog program can be efficiently rewrit¬ 
ten into an equivalent program using only very restricted syntax. This motivates a 
normal form for monadic datalog over trees. 
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Definition 5.1. A monadic datalog program V over T r ^ ( r ur ) is in Tree-Marking 
Normal Form (TMNF) if each rule of V is of one of the following three forms: 

(1) p{x ) <- p 0 (x ). (2) p(x) ^~Po{xo), B(x 0> x). (3) p(x) ^ p 0 {x), pffx). 

where the unary predicates po and p\ are either intensional or of r (r ur ) and B is 
either R or i? _1 , where R is a binary predicate from T r } z (r ur ). □ 

For our main result of this section, the signature for unranked trees may extend 
T ur to include the natural child relation - likely to be the most common form of 
navigation in trees - and the “lastchild” relation; “lastchild (a:, y)" is true iff y is 
the rightmost child of x. 

Theorem 5.2. For each monadic datalog program V over r ur U {child, lastchild} 
(resp., T r f.), there is an equivalent program in TMNF over r ur (resp., r r ^) which 
can be computed in time 0(\P\). 

In order to prove this, we need to introduce a number of auxiliary results. The 
main steps we take to transform an arbitrary program V into one in TMNF will be to 
(1) translate V into a program in which each rule is acyclic (in a very strong sense) 
but which extends the signature to caterpillar expressions (Lemma 5.4, 5.5, and 
5.6), to (2) simplify the acyclic rules (Lemma 5.7 and 5.8), and to (3) rewrite these 
short and simple rules into ones that do not use caterpillar expressions (Lemma 5.9). 

We have to put some emphasis on mapping programs to TMNF in linear time, 
as our result on the complexity of Elog~ (Corollary 6.4) depends on it. Thus, we 
will start by introducing some graph-theoretical background. 

Given a directed graph (digraph) G = (V,E), a depth-index map is a (total) 
function do '■ V —> Z such that dc(v) + 1 = dc{w) iff (v,w) £ E. 

Proposition 5.3. Given a digraph G, a depth-index map do exists iff all paths 
between (not necessarily distinct) nodes v,w in G have the same length. 

In particular, if G contains a (directed) cycle, no depth-index map exists for 
G. We can decide whether a depth-index map exists for G, and at the same time 
computeamap do if it does, in time 0(|F| + |.E|) by a straightforward traversal of G 
(assigning, say, dc(v) = 0 for the first node v visited in each connected component 
of G, visiting neighbors via out-going as well as incoming edges, and marking nodes 
as visited to ensure linear runtime). Even though depth-index maps on a graph are 
not unique, all depth-index maps are equally well suited for our purposes. 

Given an undirected graph G = {V, E), the set of connected components C of G 
is defined in the normal way. Notably, (J C = V and the members of C are pairwise 
disjoint. The connected components of a digraph G are are those of the shadow of 
G, i.e., of the undirected graph obtained from G by ignoring the edge directions. 

A multigraph is a pair (V, E) of disjoint sets together with a map E —* I4U[E] 2 as¬ 
signing to each edge either one or two vertices, its ends. (By [ V ] 2 we denote the two- 
element subsets of V".) The query graph of a monadic datalog rule r over signature F 
is the multigraph G r = (V,E), with V = Vars(r), E = | R{x,y) £ Body(r )} 

and eR, x , y i—> {x,y}. So, for a rule r with Body(r) = {R{x,y),R{y,x)}, the query 
graph has two undirected edges en tXt y,eii^ y:X with the same ends, {x,y}. 10 A rule 


10 This rule would be considered cyclic because there are two different paths between x and y. 
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is called acyclic iff its query graph is acyclic (i.e., is an undirected forest). 

Lemma 5.4. Every monadic datalog program V overr r * (r U r) can be rewritten in 
time 0(\V\) into an equivalent program over r r * (r U r) in which each rule is acyclic. 

Proof. We only consider programs over r r *. Since unranked trees represented 
using T ur can be viewed as binary trees, T ur can be treated as a special case of 
ry*. For each rule r £ V, we proceed as follows. Let G = ( Vars{r),E) with 
E = {(x,y) | (3k) child*(x, y) £ Body(r)} be a digraph. If no depth-index map 
do '■ Vars(r ) ->ZonG exists, r is unsatisfiable and no output is produced for r. 

Otherwise, we proceed as follows for each 1 < k < K. Let Gk = ( Vars(r),Ek ) 
be the digraph with Ek = {(x,y) | child*,(a;, t/) £ Body(r)} and let C* be the set 
of connected components of G*. For each connected component C £ C* and for 
each depth-index i. replace all occurrences of the variables in the equivalence class 
{x € C | dc(x) = i} in r by any single variable of that equivalence class. If the 
query graph of the rule r’ thus obtained is cyclic, r’ (and r) is unsatisfiable. If r’ is 
acyclic, we add r’ to the output and proceed to the next rule of V. 

The method described can be easily implemented to run in linear time. Moreover, 
the output is also equivalent to the input. Assume that no depth-index da can be 
computed for r. By Proposition 5.3, this means that G contains a cycle or two paths 
of different length. Since the union of the relations child* (for 1 < k < K) is a tree, 
no satisfying variable assignment can exist for r then, and r is indeed unsatisfiable. 
Unsatisfiable rules can be removed from the program without changing its meaning. 

The rule r' obtained from r by merging variables is equivalent to r. Clearly, r’ is 
precisely the rule we would obtain by simplifying r using the bidirectional functional 
dependencies of the child* relations (cf. Proposition 4.1) with the classical Chase 
technique (cf. [Aho et al. 1979; Maier et al. 1979; Abiteboul et al. 1995]). 

Since G does not contain a directed cycle, cycles of the query graph of r’ must 
contain two atoms i?i(xi, y), R. 2 (x 2 ,y), where i?i ^ i? 2 - However, these two atoms 
taken together are certainly unsatisfiable, as each node can only be a k- th child for 
at most one k. □ 

Lemma 5.5. Every monadic datalog program V over r ur U {child} can be rewrit¬ 
ten in time 0(|P|) into an equivalent program over T U r U {nextsibling*} in which 
each rule is acyclic. 

Proof. For each rule r £ V we proceed as follows. 

(1) Let G ns = (V ns ,E ns ) be the digraph having V ns = Vars(r) and E ns = 
{(x,y) | nextsibling(x, y) G Body(r)}, C the set of connected components of G ns , 
and G c h = ( C,E c h ) the digraph of the child relations coarsened to C ((Gi,G 2 ) G 
E c h iff Ci, C 2 G C and there are variables x\ G Gi and X 2 G C 2 such that an 
atom firstchild(xi, X 2 ) or child(xi, X 2 ) occurs in Body(r)). If no depth-index map 
d : C —> Z on graph G c * exists, then r is unsatisfiable and we are done for r. 

(2) Digraph G c * is now acyclic, and we traverse it bottom-up, unifying variables 
X\ , X 2 that are parents of variables in the same connected component of C. Let 
dmin = min{d(G) G G C} be the smallest and d max = max{d(G) | G G C} 
the largest depth-index in d on C. Let C[i] = {G G C \ d(C ) = i}, for each i. 
For each i from d max to d m i n + 1, we compute the bipartite graph Bi with nodes 
Vars(r)CC[i\ and edges {(x,G) | x £ Vars(r), C G C\i\,3y £ G s.t. firstchild(x, y) G 
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Body{r) or childly) £ Body(r)}. Let Cg* denote the set of connected components 
of Bi. For each C £ Cst, merge the variables of Vars(r) R C in r into one. 11 Update 
C[i — 1] and the portions of the data structures for G ns , G c h and d relating to 
depth-index i — 1 accordingly. 

(3) For each connected component C £ C[c? m i n ] of graph G ns , compute a depth- 
index map dc ■ C —► Z on the subgraph of G n s induced by C; if no such depth-index 
map exists, r is unsatisfiable and we are done for r. For each i, merge the variables 
of {x £ C dc(x) = i} in r into one. We update the parts of our data structures 
relating to depth-index d m j rl . 

(4) We traverse the component graph G c h top-down, starting at the components 
with smallest index i = d m ; n up to d max — 1. For each C £ C[i\ and each x £ C, we 
merge the variables F x = {y \ firstchild(a;, y) £ Body(r)} into one. This can either 
be done by building a bipartite graph as in step (2) or ad-hoc, since after step (2), 
sets F X1 ,F X2 must be disjoint for X\ x 2 . Then we simplify the “nextsibling” 
atoms of depth-index * + 1 as described in step (3) for depth-index d m i n . 

(5) Finally, for each component C £ C such that there is an atom child(a;, y), y £ 
C, but no atom firstchild(a;, z), for any 2 £ C, proceed as follows. Choose precisely 
one y £ C such that child(rc, j/) £ Body(r). If there is an atom firstchild(ir, y') 1 add 
nextsibling*(y', y). Otherwise, add atoms firstchild(x, yF) and nextsibling* (yo, y), 
where yo is a new variable. Finally, remove all “child” atoms from r. 

An example illustrating the rewriting technique is shown in Figure 3. In (a), the 
body of input rule r is sketched; (b) shows the rule after the completion of step (2); 
(c) after step (4), and (d) shows the final result. Merged variables are displayed as 
sets rather than as single variables to support the presentation. 

It is not difficult to verify that the described rewriting technique runs in linear 
time. Most notably, the two traversals of G c h (by depth-index) in steps (2) and 
(4) only change parts of the data structures pertaining to the respective current 
depth-index in each iteration and therefore only consume linear time in total. 

It is also correct. The graph of the “child” relation is a tree, so if no depth- 
index map d exists for G c h, r is indeed unsatisfiable (see the related argument in 
the proof of Lemma 5.4) and can be dropped. Step (2) - in conjunction with the 
preparations of step (1) - is simply an elaborate linear-time method of “chasing” 
the functional dependency “child”: $2 —► $1 (i.e., that each node has at most 
one parent) in r and simplifying r accordingly. At the end of step (1) G c h is 
acyclic, and after step (2) G c h is a forest. The important observation is just that 
this functional dependency does not interfere with the others - in case we unify two 
variables when returning top-down (using the bidirectional functional dependencies 
of “nextsibling” and “firstchild” on variables “higher up” in r), no further variables 
can be unified using the functional dependency of “child”. 

When going top-down in steps (3) and (4), we act as if chasing the functional 
dependencies of “nextsibling” at depth-index i before we merge nodes at depth i +1 
using the functional dependency “firstchild”: $1 —► $2. By proceeding in a different 
order, we might miss out on variables that could be merged. After step (4), we have 
either found r to be unsatisfiable or the connected components of G ns have been 


11 Here and below, we mean by this to replace all occurrences of variables in the given set in r by 
any single variable from the set (say, the lexicographically first one), or by a new variable. 
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(c) (d) 

Fig. 3. Translation into acyclic rule; /, c, n denote resp. “firstchild”, “child”, and “nextsibling”. 


transformed into linear chains and for each C G C there is at most one xGC such 
that there is an Xq with firstchild(xo, x) G Body{r). In step (5), we rewrite such a 
rule into an acyclic one, which is equivalent to the input rule from V. □ 

Lemma 5.6. Every monadic datalog program V over T ur U {child, lastchild} can 
be rewritten in time 0{\V\) into an equivalent program over T U rU {nextsibling*} in 
which each rule is acyclic. 

Proof. We replace each occurrence of an atom lastchild (ay y) in V by child(x, y), 
lastsibling(y) and employ Lemma 5.5 to obtain a program V' in which each rule 
is acyclic, in which we replace each atom lastsibling(x) by lastchild(xoj x) (xo is a 
new variable). Correctness and linear runtime are easy to verify. □ 

Note that the purpose of the previous three lemmata is not to detect all unsat- 
isfiable rules or to minimize rules, just to render them acyclic. (And indeed, our 
superficial treatment of “lastchild” atoms and our disregard for unary predicates 
such as “root” and “leaf” leaves many opportunities for further optimization.) 

The following algorithm decomposes acyclic rules into ones that contain at most 
a single binary atom in the body. 

Lemma 5.7. Let V be a monadic datalog program over finite structures a con¬ 
sisting only of unary and binary relations in which each rule is acyclic. Then, V 
can be rewritten in time 0(\V\) into an equivalent monadic Datalog LIT program. 
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Proof. For a rule r, we call a variable x an ear of r iff x occurs in precisely one 
binary atom of Body(r). 

Given a monadic datalog program V over arbitrary unary and binary predicates, 
we apply the following steps as long as there is at least one rule r € V with head 
variable q that has an ear x ^ q: Let S r (x ) = {Pi(x), ..., P m (x), R(x, x')} be the 
set of all atoms over x in r. Since x is an ear, there is only (at most) one binary 
atom containing x, all other atoms in S r (x) are unary. (If the binary atom linking 
x and x' in the query graph of r is actually of the form Rq(x',x), let R = Rq 1 .) 
Remove all atoms of S r (x) from r and insert (r, x).R(x') instead, where (r,x).R is 
a new predicate. Add a new rule with (r, x).R(x') as head and S r (x) as body. 

Clearly, the program computed by this procedure is equivalent to the input pro¬ 
gram. It can also be easily made to run in linear time. On its termination, each 
rule in the output is in monadic Datalog LIT. □ 

Lemma 5.8. Let r be an acyclic monadic datalog rule over relations that are 
either unary or binary. Then, r can be decomposed in linear time into a monadic 
datalog program in which each rule is of one of the three forms 

p(xi) <-pi(xi), p 2 (x 2 ). p(x) <-p 0 {x 0 ), R{x 0 ,x). p{x) <^p 0 {x 0 ), R(x,x 0 ). 
where x\ (pi) may but does not have to be different from X 2 (P 2 )- 

Proof. Little postprocessing of the output of the algorithm of the proof of 
Lemma 5.7 is needed to decompose r into rules of these three forms. All we need 
to do is - in case \Body(r)\ > 2 - to replace pairs p±(x),p 2 (y) of unary atoms in r 
(where y either does not appear elsewhere in r or x = y) by an atom p(x) (where 
p is a new predicate) and add the rule p(x) <— pi(x), p 2 (y)- to the output. □ 

Lemma 5.9. Let T be a set of binary relations and let p be a unary predicate. 
Given a caterpillar expression E over T, there is an 0(\E\) time algorithm for 
computing a monadic datalog program over T that defines the unary predicate 

p.E := {x | (3xo) p{xo) is true and (xo,x) G [A]}. 

Proof. By Proposition 2.4, we may assume w.l.o.g. that E is syntactically a 
regular expression over the alphabet T U (I? -1 | R G T}. It is well known that 
each regular expression can be translated in linear time into an equivalent nonde- 
terministic finite automaton with e-transitions Ae = (Q, s, <5, F) (cf. [Hopcroft and 
Ullman 1979]). Let Ti denote the unary and r 2 the binary relations of P. It is easy 
to see that the monadic datalog program 


{s{x) <- 

■p{x).} U 



{q 2 (x) <- 

- qx(x). | 

(9ij G 92 ) G 5} U 


{q 2 {x) <- 

- qi{xo), 

r(x 0 ,x). | (qi,r,q 2 ) G 6,r 

Gr 2 }u 

{q 2 {x) <- 

- qi{xo), 

r(x,x 0 ). | (qi,r~ 1 ,q 2 ) G <5 

,rGr 2 ] 

{q 2 {x) <- 

~qi{x), p{x). | (qi,p,q 2 ) G 5 or (q 1 

I? -1 ) 92 ) 

{P-E{x) 

9 / 0 * 0 - 

1 9/ G F} 



can be computed in linear time. The idea employed in the encoding is reminiscent 
of Yannakakis’ semi-join-based algorithm for evaluating acyclic conjunctive queries 
[Yannakakis 1981] and indeed defines p.E on the basis of Ae- □ 
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Clearly, the techniques of the proofs of Lemma 5.7 and Lemma 5.9 are remi¬ 
niscent of long-known results on the evaluation of acyclic conjunctive queries (cf. 
[Yannakakis 1981; Abiteboul et al. 1995]). However, our notion of acyclicity used 
for rules is more restrictive and tailored towards the class of rules produced by 
Lemma 5.4 and Lemma 5.5. 

Example 5.10. The relation “child” is definable by the regular path expression 
firstchild.nextsibling* over r ur . A (deterministic) finite automaton for “child” is 

firstchild 

->-o- 

Qi 

Our monadic datalog representation of p .child is 

qi(x) +- p(x). q 2 {x) <— < 7 i(x 0 ), firstchild(xo, x). 

p.child(x) «— q 2 (x). q 2 {x) <— q 2 {x 0 ), nextsibling(x 0 ,x). □ 

We are now in the position to prove the main theorem of this section. 

Proof of Theorem 5.2. We first apply Lemma 5.4 (for r r j.) or Lemma 5.6 (for 
T ur U {child, lastchild}) to obtain an acyclic program V' from the input program V. 
Next, we rewrite each rule r of V into an equivalent rule in which the query graph 
is connected. For instance, a rule p(x) <— pi(x), P 2 {y). with distinct variables x 
and y is rewritten into rule p(x) <— pi(x), E(x, y), P 2 {y)- where E is the caterpillar 
expression (a | e | A -1 ) and A is the document order relation (cf. Example 2.5). 
Then, we apply Lemma 5.8 to obtain a (connected) monadic Datalog LIT program 
with at most two body atoms in each rule and in which all rules are connected. 
(The transformation used in Lemma 5.8 preserves connectedness; given a rule that 
is connected as input, the output rules are connected as well.) This is already our 
TMNF normal form syntax. Finally, we eliminate caterpillar expressions from the 
program using the technique from Lemma 5.9. As is easy to verify, the rewriting 
technique of Lemma 5.9 only produces TMNF rules. □ 

Remark 5.11. As shown in the proof of Lemma 5.9, TMNF programs contain¬ 
ing at most one intensional predicate in each rule body are sufficient to encode 
caterpillar expressions relative to, say, the root node. Caterpillar expressions cor¬ 
respond in expressive power to tree-local languages and tree-walking automata and 
are conjectured to capture only a proper subset of the regular tree languages (cf. 
[Neven 2002; Briiggemann-Klein and Wood 2000]). The nonexistence of a more 
restrictive normal form than TMNF (where in rules of form (3) the predicates pi 
must be from r r j. or T ur ) thus depends on the widely held (but as of yet unproven) 
conjecture that tree-walking automata are less expressive than MSO over trees, n 

We conclude this section by a simple result, whose relevance is due to the re¬ 
lationship between caterpillar expressions and XPath queries [World Wide Web 
Consortium 1999]. The containment problem for XPath is currently being actively 
investigated (e.g. [Neven and Schwentick 2003; Miklau and Suciu 2002]). 

We call a single-rule program (Q(x) <— root.E(x).}, where E is a caterpillar 
expression over or r ur , a unary caterpillar query. Let Q\ and Q 2 be unary 
caterpillar queries. Q 1 is called contained in Q 2 iff the result for Q\ is contained in 
the result for Q 2 on all trees. 



q 2 j nextsibling 
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Corollary 5.12. For unary caterpillar queries, the containment problem is 
PSPACE- complete. 

Proof. The construction of the proof of Lemma 5.9 only uses monadic linear 
datalog (that is, where each rule contains at most one intensional predicate in the 
body), for which it is known that the containment problem is PSPACE-complete 
[Cosmadakis et al. 1988]. Membership of our containment problem in PSPACE 
follows. PSPACE-hardness follows by a straightforward reduction of the PSPACE- 
hard containment problem for regular expressions (on words) to this problem. □ 

6. VISUAL TREE WRAPPING: THE ELOG LANGUAGE 

We now make a bridging step from the main topic of this article so far, monadic 
datalog over trees, to extracting information from parse trees of Web documents. 

So far we have only shown how to define unary queries in monadic datalog, but 
will now briefly sketch the definition of wrappers. In our framework, a wrapper is 
defined as a set of unary queries, “information extraction functions”, that select 
tree nodes. A monadic datalog program can compute a set of such queries at 
once. Each intensional predicate of a program selects a subset of dom and can be 
considered to define one information extraction function. 

Given a set of information extraction functions, one natural way to wrap an input 
tree t is to compute a new label for each node n (or filter out n ) as a function of the 
predicates assigned using the information extraction functions. The output tree is 
computed by connecting the resulting labeled nodes using the (transitive closure 
of) the edge relation of t, preserving the document order of t. We do not formalize 
this operation here; the natural way of doing this is obvious. 

6.1 Monadic Datalog as a Wrapper Programming Language 

In the previous section, we have shown that monadic datalog has the expressive 
power of our yardstick MSO (on trees), can be evaluated efficiently, and is a good 
(easy to use) wrapper programming language. Indeed, 

—The existence of the normal form TMNF of Section 5 demonstrates that rules in 
monadic datalog never have to be long or intricate. 

—The monotone semantics makes the wrapper programming task quite modular 
and intuitive. Differently from an automaton definition that usually has to be 
understood entirely to be certain of its correctness, adding a rule to a monadic 
datalog program usually does not change its meaning completely, but adds to 
the functionality. 

—Handling unranked trees is a necessity in wrapping Web documents. The use 
of the signature r ur (or even r U r U {child}) with monadic datalog introduces no 
notational difficulties. Working on unranked trees is just as simple as working 
on ranked trees. 

—Wrappers defined in monadic datalog only need to specify queries, rather than the 
full source trees on which they run. This is very important to practical wrapping, 
because this way changes in parts of documents not immediately relevant to the 
objects to be extracted do not break the wrapper. 
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This property of monadic datalog programs is shared with the wrapping lan¬ 
guages of the implemented tree-based wrapping systems [Sahuguet and Azavant 
2001; Liu et al. 2000; Baumgartner et al. 2001a], but not by query automata 
or attribute grammars (or string-based wrapping frameworks, for that matter). 
Unary queries in monadic datalog are less work-intensive to define than their 
query automata or attribute grammar counterparts in the first place, and are 
subsequently less costly to maintain. 

Only one of the four desiderata from the introduction remains to be addressed, 
the visual specification of wrappers. In the remainder of this section, we introduce 
a framework for satisfying it which is based on the existing wrapping language Elog. 

Elog programs can be completely visually specified. The fragment Elog - pre¬ 
sented below is closely related to monadic datalog over trees and allows to express 
precisely the same unary queries. Thus, the capability of specifying unary queries 
entirely visually is also inherited by MSO. 

6.2 Visual Wrapper Specification 

As discussed in the introduction, by visual wrapper specification, we refer to the 
process of interactively defining a wrapper from few example documents using ide¬ 
ally mainly “mouse clicks”. 

The visual wrapping process in systems such as Lixto [Baumgartner et al. 2001a; 
2001b] heavily relies on one main operation performed by users: By marking a 
region of a Web document displayed on screen using an input device such as a 
mouse, the node in the document tree best matching the selected region can be 
robustly determined. By selecting a reference region followed by a second region 
inside the former, it is possible to define a fixed path tt in an example document. 
We introduce a special predicate for checking such paths. 

Definition 6.1. Let £ be an alphabet not containing For strings tt £ 
(£ U _)*, the predicate subelem^ is defined inductively as follows: 


subelem £ (x, y) 
subelem_.Tr (x, y ) 
subelem a7r (a:, y) 


x = y. 

child(x, z), subelemTr( 0 , y). 
child(x,z), label a (z), subelem,,.^, y). 


□ 


The symbol ‘J thus is a wildcard matching any symbol and allows to generalize 
from visually gathered paths. Note that the definition of subelem is nonrecursive 
and for each path tt, subelem,,- is defined through a fixed conjunction of child and 
label atoms. (Theorem 5.2 showed how to eliminate child atoms to obtain programs 
strictly over T ur .) The term x = y is not an atom. We assume that when we 
encounter it while rewriting a subelem atom into a set of monadic datalog atoms 
over T ur , we replace each occurrence of variable y in the rule by x. For example, 
subelem a .b(a;, y) is a shortcut for child(x, z), label a (z), child(z, y), label;,(y), where 
z is a new variable. 

Subsequently, we refer to monadic intensional predicates as pattern predicates or 
just patterns. Patterns are a useful metaphor for the building blocks of wrappers. 

Given an example document representative for a family of documents to be 
wrapped, a user may be guided in the graphical specification of a rule as follows. 
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—First, a destination pattern p is named (which may be new) and a parent pattern 
Po is selected from among the patterns defined so far. Initially, the only pattern 
available is the “root” pattern. 

The “root” pattern corresponds to the extensional predicate root of r ur and is 
the only exception to the correspondence of patterns and intensional predicates. 
The system can then display the document and highlight those regions in it which 
correspond to nodes in its parse tree that are classified po using the wrapper 
program specified so far. 

-A new rule is defined by selecting - by a few mouse clicks over the example doc¬ 
ument - a subregion of one of those highlighted. The system can automatically 
decide which path n relative to the highlighted region best describes the region 
selected by the user. 

-The rule p(x ) <— Po(xo), subelem 7r (xo, x). obtained in this way can then be 
refined by generalizing the path or adding conditions. These tasks can be carried 
out visually as well (see [Baumgartner et al. 2001a]). 

Very few example documents are needed for defining a wrapper program: It is 
only required that for each rule to be specified, there exists a document in which an 
instance of the parent pattern can be recognized and an instance of the destination 
pattern relates to it in the desired manner. 

The process outlined is used in the Lixto system and is described in more detail 
in [Baumgartner et al. 2001b; 2001a], where many examples and screenshots are 
dedicated to the visual specification process. 

6.3 The Core Fragment: Elog - 

In the remainder of this section, we introduce various simplified fragments of the 
wrapping language Elog presented in [Baumgartner et al. 2001b; 2001a]. By these 
simplifications we obtain wrapping languages whose theoretical aspects are simpler 
to study. Certain redundancies and artifacts of the Elog language are neither elim¬ 
inated nor discussed in great detail here; they witness Elog’s lineage as a practical 
language that has grown over time. 

We start with the wrapping language Elog - , which is basically a fragment of 
monadic datalog over trees. Later, we add some sophistication to the way in which 
trees can be extracted, and define the fragment Elog^ which uses a very restricted 
kind of binary intensional predicates to allow to skip certain nodes of the input 
tree in the wrapping process. While Elog]]; slightly extends the supported builtin 
predicates as compared to Elog - , both fragments are just as expressive as MSO in 
their power to define unary queries. 

Definition 6.2. Let II = (EU {-})* denote our language of fixed paths. The 
language Elog - is a fragment of monadic datalog over 

(root, leaf, firstsibling, nextsibling, lastsibling, (subelem T ) w6 n, (contains 7 r) 7 ren) 

where “root”, “leaf”, “nextsibling”, and “lastsibling” are as in T ur , “firstsibling” 
has the intuitive meaning symmetric to “lastsibling”, “subelem,,” was defined in 
Definition 6.1, “contains,/’ is equivalent to “subelem^”, except that e-paths must 
not be used, “leaf”, “firstsibling”, “nextsibling”, “lastsibling”, and “contains” are 
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called condition predicates , and rules are restricted to the form 
p(x) <—po(xo), subelem,r (xq, x ), C, R. 

such that p is a pattern predicate, po - the so-called parent pattern - is either a 
pattern predicate or “root”, R (pattern references ) is a possibly empty set of atoms 
over pattern predicates, and C is a possibly empty set of atoms over condition 
predicates. Moreover, the query graph of each rule must be connected. 

We may write rules of the form p(x) <— Po(xq), subelem e (xo, x), C, R. equiva¬ 
lently as p(x) <— Po{x), C, R. and call such rules specialization rules. □ 

Remark 6.3. Compared to a strict fragment of Elog, this definition is simplified 
in several respects. In fact, “leaf” does not exist in Elog, but can be simulated using 
stratified negation, which is supported. The “root”, “firstsibling”, and “lastsibling” 
relations are called “rootdocument”, “firstson”, and “lastson”, respectively, and 
have additional columns. Instead of “nextsibling”, Elog provides “before” and 
“after” predicates, which can be parameterized (basically by setting their distance 
tolerance arguments, which specify how far apart two matching nodes may be, to 
zero) to capture the meaning of “nextsibling”. □ 

By replacing each occurrence of the “subelem” and “contains” shortcuts by the 
“child” atoms they denote (see Definition 6.1), Elog - becomes a fragment of mo¬ 
nadic datalog over r U r U {child}. By Theorems 5.2 and 4.2, monadic datalog over 
t ur U {child} (and thus Elog - ) is still in linear time in terms of query and data, 
respectively. 

Corollary 6.4. An Elog~ program, V can be evaluated on a tree t in time 
0(\V\ * | dom t |). 

As stated next, Elog~ retains the wrapping power of MSO (and equally, monadic 
datalog) over unranked trees. 

Theorem 6.5. A set of information extraction functions is definable in monadic 
datalog over r U r iff it is definable in Elog~. 

Proof. Of course, each wrapper expressible in Elog - is also expressible in 
monadic datalog over r ur . All that has to be done to translate from the first 
to the second language is to eliminate all occurrences of “subelem” and “contains” 
using Definition 6.1. 

The other direction is more interesting. By Theorem 5.2, it suffices to show that 
each program in our normal form can be defined in Elog - . 

This is easily possible. Monadic datalog rules that contain only unary atoms are 
already correct Elog - specialization rules, with the exception of those containing 
“label”. Rules containing “label”, e.g. 

p{x) *— label a (ir). 

are translated into 

p(x) dom(xo), subelem a (xo, x). 

A pattern “dom”, which matches any node, is easily definable using a two-rule 
recursive program that assures that the root node matches pattern “dom” and so 
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do all children of nodes that match “dom”. In Elog , “nextsibling” is a condition 
predicate, so we rewrite normal form rules containing “nextsibling”, such as 

p(x) <— po(xo), nextsibling(xo, x). 

into specialization rules, here 

p(x) <—dom(x), nextsibling(xo, x), pq(xq). 

In this rule, dom(x) is the parent pattern, nextsibling(xo, x) a condition atom, and 
Po(xo) a pattern reference. 

There are two cases of rules containing “firstchild”, 

p(x) <— po(xo), firstchild(xo, x). and p(x)<—po(y), firstchild(x, y). 

The second is interesting because we want to infer patterns upward in the tree and 
“subelem” predicates can only be used downward. We rewrite the rule into 

p(x) <—dom(a;), contains_(x, y), firstsibling(y), po{y)- 
using a specialization rule in conjunction with a “contains” atom. □ 

Note at this point that the full Elog language of [Baumgartner et al. 2001a] is 
strictly more expressive than MSO. 12 For example, Elog supports so-called distance 
tolerances in “before” and “after” predicates. Let Elog^ be the new language 
obtained from Elog - by extending its “before” predicate by a distance tolerance, 
which is a pair of percentage values such that whenever xo refers to a node with 
k children, before 7rj „%_ j g%(a;o, x, y) requires that among the nodes reachable from 
node xo via path 7r £ E*, x is at least k ■ ^ and at most k ■ ^ before y. An 
Elog atom notafter w (x, y) (resp., notbefore T (x, y)) is true if node y does not occur 
after (resp., before) a node reachable from node x via path 7r £ E* in the document 
(w.r.t. document order). 

Theorem 6.6. The Elog] ^ language is strictly more expressive than unary MSO 
queries over unranked trees. 

PROOF. Consider the Elog^ program V 

do (a:) *— root(xo), subelem a (xo, x), notafter a (xo, x). 

&o(x) <— root(xo), subelemb(xo, x), notafterf,(xo, x), notbefore a (xo, x). 
a n b n {x) <- root(x), contains a (x, y), a 0 (y), before 6i50% _ 50% (x, y, z), b 0 (z). 

over E = {a, b}. 

The leftmost children of the root node labeled a and b are assigned the predicates 
a o and bo, respectively, if in addition there is no node labeled a at the right of the 
node assigned b 0 . If both a o and b 0 are assigned to nodes, the labels of the children 
of the root node read from left to right must constitute a word a n b m . Let the root 
node have k children. The root node is assigned a n b n if there are two children m 

12 Full Elog supports Web crawling, stratified negation, so-called distance tolerances in “before” 
and “after” atoms, and tree region extraction, all features missing from the fragments discussed 
here. Presenting these features in detail is beyond the scope of this paper, but a detailed overview 
of the full Elog language is given in [Baumgartner et al. 2001b; 2001a], 
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and 712 labeled a o and bo, respectively, such that ri 2 is k/2 nodes to the right of n\ 
among the children of the root node. Thus, V classifies the root node as a n b n if 
and only if its list of children is of the same form. However, it is well known that 
the word language {a n b n \ n > 1} is not regular, so neither is the tree language 
{t | a"6"(root t ) G 7£}. □ 

7. SUMMARY AND CONCLUSIONS 

We studied the expressiveness and complexity of monadic datalog over trees and 
the core fragment of its close relative, the practical wrapper programming language 
Elog. We showed that the expressive power of both languages is precisely that of 
the unary MSO queries. As a significant by-product which may be useful in future 
investigations, we discovered a simple normal form for monadic datalog over trees, 
TMNF, to which every program can be translated in linear time. 

In summary, we have studied a significant new practical application of logic (pro¬ 
gramming) to information systems from a theoretical perspective. The database 
programming language datalog, which has received considerable attention from the 
database theory community over many years (see e.g. [Abiteboul et al. 1995]) but 
has ultimately failed to attract a large following in database practice, might thus 
experience a notable “rebirth” in the context of trees and the Web. Indeed, for 
datalog as a framework for selecting nodes from trees, the situation is substantially 
different from the general case of full datalog on arbitrary databases. Monadic dat¬ 
alog over trees has very low evaluation complexity, programs have a simple normal 
form, so rules never have to be long or intricate, and various automata-theoretic, 
language-theoretic, and logical techniques exist for evaluating programs or optimiz¬ 
ing them which are not available for full datalog. 

As a final remark, monadic datalog also has applications in querying XML and 
checking the conformance of XML documents to DTD’s and regular tree languages. 
Indeed, Core XPath [Gottlob et al. 2002], the logical core fragment of the popular 
XPath language, can be mapped efficiently to monadic datalog [Gottlob and Koch 
2002b; Frick et al. 2003] and thus inherits its very favorable worst-case evaluation 
complexity bounds. 
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